Fix splitFasta output dir

zhuchcn commented 1 year ago

splitFasta output dir is changed to splitFiltered if 'filterFasta' is used, otherwise still 'filter'.
Added summarizeFasta right after mergeFasta.

Example output below with merge_variant_noncoding set to 'both'.

test/output/test-integration-merge/call-NonCanonicalPeptide-1.0.0/UCLA0001/moPepGen-0.11.3/output/
├── decoy
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_circRNA_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_Fusion_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_gINDEL_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_gSNP_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_gSNP_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_Noncoding_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite_encode_decoy.fasta
│   └── UCLA0001_splitFiltered_RNAEditingSite_encode_decoy.fasta.dict
├── encode
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_circRNA_encode.fasta
│   ├── UCLA0001_splitFiltered_circRNA_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion_encode.fasta
│   ├── UCLA0001_splitFiltered_Fusion_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode.fasta
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode.fasta
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL_encode.fasta
│   ├── UCLA0001_splitFiltered_gINDEL_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_gSNP_encode.fasta
│   ├── UCLA0001_splitFiltered_gSNP_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_Noncoding_encode.fasta
│   ├── UCLA0001_splitFiltered_Noncoding_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode.fasta
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite_encode.fasta
│   └── UCLA0001_splitFiltered_RNAEditingSite_encode.fasta.dict
├── splitFiltered
│   ├── UCLA0001_splitFiltered_circRNA.fasta
│   ├── UCLA0001_splitFiltered_Fusion.fasta
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding.fasta
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA.fasta
│   ├── UCLA0001_splitFiltered_gINDEL.fasta
│   ├── UCLA0001_splitFiltered_gSNP.fasta
│   ├── UCLA0001_splitFiltered_Noncoding.fasta
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA.fasta
│   └── UCLA0001_splitFiltered_RNAEditingSite.fasta
├── UCLA0001_merged_peptides.fasta
├── UCLA0001_merged_peptides_filtered.fasta
├── UCLA0001_merged_peptides_filtered_summary.txt
├── UCLA0001_merged_peptides_summary.txt
├── UCLA0001_noncoding_peptides_filtered.fasta
├── UCLA0001_variant_peptides.fasta
├── UCLA0001_variant_peptides_filtered.fasta
├── UCLA0001_variant_peptides_filtered_summary.txt
└── UCLA0001_variant_peptides_summary.txt

Closes #80

[X] I have read the code review guidelines and the code review best practice on GitHub check-list.
[X] The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)-[brief_description_of_branch].
[X] I have set up or verified the branch protection rule following the github standards before opening this pull request.
[X] I have added my name to the contributors listings in the metadata.yaml and the manifest block in the nextflow.config as part of this pull request, am listed already, or do not wish to be listed. (This acknowledgement is optional.)
[ ] I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.
[ ] I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)
[X] All test cases have passed.

Closes #...

lydiayliu commented 1 year ago

Why splitFiltered but not split_filtered to match the rest of the formats?

Can you also try with a sample where split_fasta = true; filter_fasta = true but no expression table is provided? there should be nothing outputted if in split_filtered. I also don't think there should be anything outputted in split?

zhuchcn commented 1 year ago

Why splitFiltered but not split_filtered to match the rest of the formats?

Done! Here is what looks like right now with merge_variant_noncoding = both, split_fasta = true, and filter_fasta = true

test/output/test-integration-merge/call-NonCanonicalPeptide-1.0.0/UCLA0001/moPepGen-0.11.3/output/
├── decoy
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta.dict
│   ├── UCLA0001_split_circRNA_encode_decoy.fasta
│   ├── UCLA0001_split_circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode_decoy.fasta
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode_decoy.fasta.dict
│   ├── UCLA0001_split_Fusion_encode_decoy.fasta
│   ├── UCLA0001_split_Fusion_encode_decoy.fasta.dict
│   ├── UCLA0001_split_Fusion-Noncoding_encode_decoy.fasta
│   ├── UCLA0001_split_Fusion-Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_split_gINDEL-circRNA_encode_decoy.fasta
│   ├── UCLA0001_split_gINDEL-circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_split_gINDEL_encode_decoy.fasta
│   ├── UCLA0001_split_gINDEL_encode_decoy.fasta.dict
│   ├── UCLA0001_split_gSNP_encode_decoy.fasta
│   ├── UCLA0001_split_gSNP_encode_decoy.fasta.dict
│   ├── UCLA0001_split_Noncoding_encode_decoy.fasta
│   ├── UCLA0001_split_Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_split_RNAEditingSite_encode_decoy.fasta
│   └── UCLA0001_split_RNAEditingSite_encode_decoy.fasta.dict
├── encode
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta.dict
│   ├── UCLA0001_split_circRNA_encode.fasta
│   ├── UCLA0001_split_circRNA_encode.fasta.dict
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode.fasta
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode.fasta.dict
│   ├── UCLA0001_split_Fusion_encode.fasta
│   ├── UCLA0001_split_Fusion_encode.fasta.dict
│   ├── UCLA0001_split_Fusion-Noncoding_encode.fasta
│   ├── UCLA0001_split_Fusion-Noncoding_encode.fasta.dict
│   ├── UCLA0001_split_gINDEL-circRNA_encode.fasta
│   ├── UCLA0001_split_gINDEL-circRNA_encode.fasta.dict
│   ├── UCLA0001_split_gINDEL_encode.fasta
│   ├── UCLA0001_split_gINDEL_encode.fasta.dict
│   ├── UCLA0001_split_gSNP_encode.fasta
│   ├── UCLA0001_split_gSNP_encode.fasta.dict
│   ├── UCLA0001_split_Noncoding_encode.fasta
│   ├── UCLA0001_split_Noncoding_encode.fasta.dict
│   ├── UCLA0001_split_RNAEditingSite_encode.fasta
│   └── UCLA0001_split_RNAEditingSite_encode.fasta.dict
├── split_filtered
│   ├── UCLA0001_split_circRNA.fasta
│   ├── UCLA0001_split_circRNA-RNAEditingSite.fasta
│   ├── UCLA0001_split_Fusion.fasta
│   ├── UCLA0001_split_Fusion-Noncoding.fasta
│   ├── UCLA0001_split_gINDEL-circRNA.fasta
│   ├── UCLA0001_split_gINDEL.fasta
│   ├── UCLA0001_split_gSNP.fasta
│   ├── UCLA0001_split_Noncoding.fasta
│   └── UCLA0001_split_RNAEditingSite.fasta
├── UCLA0001_merged_peptides.fasta
├── UCLA0001_merged_peptides_filtered.fasta
├── UCLA0001_merged_peptides_filtered_summary.txt
├── UCLA0001_merged_peptides_summary.txt
├── UCLA0001_noncoding_peptides_filtered.fasta
├── UCLA0001_variant_peptides.fasta
├── UCLA0001_variant_peptides_filtered.fasta
├── UCLA0001_variant_peptides_filtered_summary.txt
└── UCLA0001_variant_peptides_summary.txt

zhuchcn commented 1 year ago

As discussed, we want:

Output decoy-encode and the original fasta of both filtered and unfiltered fasta for split and merge for gvf entrypoint
For fasta entrypoint, we only want the filtered fasta
Modify the metapipeline so it is able to skip samples that don't have exprs_table with fasta entrypoint.

lydiayliu commented 1 year ago

Output plain fasta, decoy / encode fasta of both filtered and unfiltered fasta for split and merge for gvf entrypoint (if filter_fasta = TRUE), in decoy / encode directories have split and split_filtered subdirectories
For fasta entrypoint, we only want the filtered fasta (if filter_fasta = TRUE)
Modify the metapipeline so it is able to skip samples that don't have exprs_table with fasta entrypoint (if filter_fasta = TRUE)

zhuchcn commented 1 year ago

We want to call encodeFasta and decoyFasta on both the unfiltered and filtered fasta. And we currently allow uses to turn of the encode/decoy functions. Should we get ride of params.encode_fasta and params.decoy_fasta? So they will always be called. This just makes the logic simple a little bit.

lydiayliu commented 1 year ago

No for CCLE I specifically turn off encode_fasta and decoy_fasta because I literally don't need all the encode and decoy fastas flotting around... they can't be used as input to merge

zhuchcn commented 1 year ago

I think I implemented the way you want. There are too many output files so the complete tree output probably won't fit here. I put the directory structure for fasta, gvf and parser entrypoint to the file below on the cluster so you can take a look.

/hot/user/czhu/pipeline-call-NoncanonicalPeptide/tree_output.txt

I also changed the parameter from merge_variant_noncoding to database_processing_modes which I think is more reasonable.

lydiayliu commented 1 year ago

We are almost there!! Commenting on the output file

For fasta entry point or process_unfiltered_fasta = FALSE, we also don't need to output the unfiltered merged.fasta

├── UCLA0001_merged_peptides.fasta
├── UCLA0001_merged_peptides_filtered.fasta
├── UCLA0001_merged_peptides_filtered_summary.txt
├── UCLA0001_noncoding_peptides_filtered.fasta                                                                                                                                                 
├── UCLA0001_variant_peptides_filtered.fasta                                                                                                                                                   
├── UCLA0001_variant_peptides_filtered_summary.txt                                                                                                                                             
└── variant_summary.txt

Although I suspect that it is easier to output it than not output it... It doesn't hurt but is a bit of waste as a duplicated file.

Otherwise, now noncoding_peptides_filtered.fasta would only be outputted as part of split right?

zhuchcn commented 1 year ago

In the last commit, I added a tag of 'variant_only' to the summarizeFasta output from the 'plain' workflow. Also updated the 'tree_output.txt'. Let me know what you think!

lydiayliu commented 1 year ago

It is a bit long winded but I think it works.

├── UCLA0001_variant_peptides.fasta
├── UCLA0001_variant_peptides_filtered.fasta
├── UCLA0001_variant_peptides_filtered_summary.txt
├── UCLA0001_variant_peptides_filtered_variant_only_summary.txt
└── UCLA0001_variant_peptides_summary.txt

I don't like how the unfiltered summary is just called _variant_peptides_summary.txt but it matches with the fasta. Let's just keep it like this!

zhuchcn commented 1 year ago

I don't like how the unfiltered summary is just called _variant_peptides_summary.txt but it matches with the fasta. Let's just keep it like this!

That's what I was thinking, too, so they can match up.

uclahs-cds / pipeline-call-NonCanonicalPeptide

Fix splitFasta output dir #81