jump-cellpainting / datasets

Images and other data from the JUMP Cell Painting Consortium
BSD 3-Clause "New" or "Revised" License
149 stars 13 forks source link

Missing cell locations parquet, source 5, plate ATSJUM206 #102

Closed Arkkienkeli closed 3 months ago

Arkkienkeli commented 4 months ago

The cell locations file that is supposed to exist by url

s3://cellpainting-gallery/cpg0016-jump/source_5/workspace/load_data_csv/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/load_data_with_illum_and_cell_location.parquet

does not exist.

image

shntnu commented 3 months ago

Turns out we had reported this issue internally but it was unresolved (ref: https://github.com/jump-cellpainting/aws/issues/75#issuecomment-1531518014). Thankfully the analysis files are available so we can regenerate it.

Here are the steps to follow

https://cytomining.github.io/profiling-handbook/05-create-profiles.html#create-database-backend

PROJECT_NAME=cpg0016-jump

mkdir -p ~/ebs_tmp/${PROJECT_NAME}/workspace/software

cd ~/ebs_tmp/${PROJECT_NAME}/workspace/software

if [ -d pycytominer ]; then rm -rf pycytominer; fi

git clone https://github.com/cytomining/pycytominer.git

cd pycytominer

python3 -m pip install -e .[collate]

and then

BATCH_ID="JUMPCPE-20210730-Run14_20210731_000211"
PLATE="ATSJUM206"
python3 pycytominer/cyto_utils/collate_cmd.py ${BATCH_ID}  pycytominer/cyto_utils/database_config/ingest_config.ini ${PLATE} \
--tmp-dir ~/ebs_tmp \
--aws-remote=s3://cellpainting-gallery/cpg0016-jump/source_5/workspace

I'll chat with @ashah03 about this and he will loop back


Done

Downloading CSVs from s3://cellpainting-gallery/cpg0016-jump/source_5/workspace/analysis/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/analysis to ../../analysis/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/analysis
Ingesting ../../analysis/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/analysis
Indexing database /home/ec2-user/ebs_tmp/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.sqlite
Uploading /home/ec2-user/ebs_tmp/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.sqlite to s3://cellpainting-gallery/cpg0016-jump/source_5/workspace/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.sqlite
Removing analysis files from ../../analysis/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/analysis and /home/ec2-user/ebs_tmp/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206
Renaming /home/ec2-user/ebs_tmp/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.sqlite to ../../backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.sqlite
Aggregating sqlite:///../../backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.sqlite
Uploading ../../backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.csv to s3://cellpainting-gallery/cpg0016-jump/source_5/workspace/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/ATSJUM206.csv
Removing backend files from ../../backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206
shntnu commented 3 months ago

@ashah03 -- ATSJUM206.sqlite is ready

aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_5/workspace/backend/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/
2022-10-21 00:42:32          0 
2024-03-14 05:17:09   54929099 ATSJUM206.csv
2024-03-14 04:22:44 44989165568 ATSJUM206.sqlite
shntnu commented 3 months ago

Thanks a lot @ashah03!

@Arkkienkeli – all set here

aws s3 cp s3://staging-cellpainting-gallery/cpg0016-jump/source_5/workspace/load_data_csv/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/load_data_with_illum_and_cell_location.parquet s3://cellpainting-gallery/cpg0016-jump/source_5/workspace/load_data_csv/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/load_data_with_illum_and_cell_location.parquet

aws s3 ls s3://staging-cellpainting-gallery/cpg0016-jump/source_5/workspace/load_data_csv/JUMPCPE-20210730-Run14_20210731_000211/ATSJUM206/load_data_with_illum_and_cell_location.parquet
2024-03-14 19:22:48   14826148 load_data_with_illum_and_cell_location.parquet
ashah03 commented 3 months ago

@shntnu can we move this file to s3://cellpainting-gallery (not staging) since everything else is there?

shntnu commented 3 months ago

@shntnu can we move this file to s3://cellpainting-gallery (not staging) since everything else is there?

Already done in https://github.com/jump-cellpainting/datasets/issues/102#issuecomment-1999453068

shntnu commented 3 months ago

From @ashah03


We are unable to create the cell locations files for some plates because we run out of memory. We don't know whether this is to do with the SQLite file or the load_data CSV file

SQLite I generated for which cell locations does NOT generated

s3://staging-cellpainting-gallery/cpg0016-jump/source_3/workspace/backend/CP_25_all_Phenix1/C13443aW/C13443aW.sqlite

Corresponding load data Parquet file:

s3://cellpainting-gallery/cpg0016-jump/source_3/workspace/load_data_csv/CP_25_all_Phenix1/C13443aW/load_data_with_illum.parquet

SQLite for which cell locations does get generated

s3://cellpainting-gallery/cpg0016-jump/source_3/workspace/backend/CP60/BR5872b3/BR5872b3.sqlite

Corresponding load data Parquet file:

s3://cellpainting-gallery/cpg0016-jump/source_3/workspace/load_data_csv/CP60/BR5872b3/load_data_with_illum.parquet
shntnu commented 3 months ago

@ashah03 to wrap this up, can you please do the following

We can return to this later so that you can keep moving for now (i.e. you are off the hook :D)

@Arkkienkeli - we seem to have hit a wall but there has to be a fix. I'll ask Cimini lab if they have any ideas on what could be wrong with the SQLite (if indeed it is the SQLite)

ashah03 commented 3 months ago

@shntnu moving to https://github.com/jump-cellpainting/datasets-private/issues/71 since this issue (source 5) is resolved