uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
6 stars 1 forks source link

Refactor code for index directory #850

Closed zhuchcn closed 7 months ago

zhuchcn commented 7 months ago

Description

I don't like the way we handle the index directory and files currently. Right now we store versions of moPepGen, python and biopython in genome.pkl, annotation_gene.idx, annotation_tx.idx, and proteome.pkl. So here I created a metadata.json file to store all essential parameters, including versions, cleavage parameters (enzyme, miscleavages, etc), and genome/annotation source (GENCODE or ENSEMBL).

I also created an IndexDir class to handle the index directory responsible for loading and dumping different index files, making the code more organized.

Closes #818

Checklist

zhuchcn commented 7 months ago

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

That's a good idea to run a dataset for sanity check. I have several PRs on the way so I'll just run the small collaborator's data (1 cell line) after all these PRs and before we release 1.3

lydiayliu commented 7 months ago

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

So with the current approach, the results would be wrong if there is a mismatch of cleavage parameters between generateIndex and callVariant? If so then I think updating that would be a good idea, we can even allow change in enzyme?

zhuchcn commented 7 months ago

With the current approach, callVariant will actually fail if the parameters don't match with that from metadata.json. Yeah we can change enzyme, too. I'll open an issue for this.