Branching from #104 and the chats with @niranjchandrasekaran , @shntnu and @johnarevalo we came up with some steps to ensure that any ID we query can be associated with a set of images.
To ensure robustness of all dataset:
[ ] Add CPG id to the PLATE and WELL tables metadata to ensure that each well has a unique path.
To deal with the missing JCP ids:
[ ] ORF: Include related JUMP pilot in both PLATE and WELL tables.
[ ] COMPOUNDS: Drop the 957 compounds that were not actually used. @afermg will do this.
Decision time:
Do we use this opportunity to change the format of files?
Pros:
A parquet file, for instance, allows us to query columns independently of each other, obviating the need of downloading the whole dataset.
An sqlite file is already compressed and can be used by most database software.
Cons:
The main motivation of using csv.gz is to reduce friction from biologists who want to access the metadata. This can be alleviated by providing a WAssembly system that fetches it on their browsers (akin to broad.io/babel or Ank's DuckDB system).
Please let me know if you have any opinions on this because depending on our decision I may need to write a script to convert csv.gz into a different format.
I am closing this in favour of another internal issue due to the sensitive nature of some of the data. The previous issue (#104) will remain open until this is solved.
Branching from #104 and the chats with @niranjchandrasekaran , @shntnu and @johnarevalo we came up with some steps to ensure that any ID we query can be associated with a set of images.
To ensure robustness of all dataset:
To deal with the missing JCP ids:
Decision time:
Do we use this opportunity to change the format of files?
Pros:
Cons:
Please let me know if you have any opinions on this because depending on our decision I may need to write a script to convert csv.gz into a different format.