Refactor code for index directory

zhuchcn commented 7 months ago

Description

I don't like the way we handle the index directory and files currently. Right now we store versions of moPepGen, python and biopython in genome.pkl, annotation_gene.idx, annotation_tx.idx, and proteome.pkl. So here I created a metadata.json file to store all essential parameters, including versions, cleavage parameters (enzyme, miscleavages, etc), and genome/annotation source (GENCODE or ENSEMBL).

I also created an IndexDir class to handle the index directory responsible for loading and dumping different index files, making the code more organized.

Closes #818

Checklist

[X] This PR does NOT contain PHI or germline genetic data. A repo may need to be deleted if such data is uploaded. Disclosing PHI is a major problem.
[X] This PR does NOT contain molecular files, compressed files, output files such as images (e.g. .png, .jpeg), .pdf, .RData, .xlsx, .doc, .ppt, or other non-plain-text files. To automatically exclude such files using a .gitignore file, see here for example.
[X] I have read the code review guidelines and the code review best practice on GitHub check-list.
[X] The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].
[X] I have added the major changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
[X] All test cases passed locally.

zhuchcn commented 7 months ago

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

That's a good idea to run a dataset for sanity check. I have several PRs on the way so I'll just run the small collaborator's data (1 cell line) after all these PRs and before we release 1.3

lydiayliu commented 7 months ago

The only place where the cleavage parameters are used is generating the canonical peptide pool. So technically we can let users generate different canonical peptide pools within the same index dir, because the annotation, genome, and proteome are all the same.

So with the current approach, the results would be wrong if there is a mismatch of cleavage parameters between generateIndex and callVariant? If so then I think updating that would be a good idea, we can even allow change in enzyme?

zhuchcn commented 7 months ago

With the current approach, callVariant will actually fail if the parameters don't match with that from metadata.json. Yeah we can change enzyme, too. I'll open an issue for this.

uclahs-cds / package-moPepGen

Refactor code for index directory #850

Description

Checklist