uclahs-cds / pipeline-call-NonCanonicalPeptide

Nextflow pipeline to call non-canonical peptides as custom databases for proteogenomic analysis
https://automatic-adventure-o4l96o9.pages.github.io/
GNU General Public License v2.0
0 stars 1 forks source link

Fix splitFasta output dir #81

Closed zhuchcn closed 1 year ago

zhuchcn commented 1 year ago
  1. splitFasta output dir is changed to splitFiltered if 'filterFasta' is used, otherwise still 'filter'.
  2. Added summarizeFasta right after mergeFasta.

Example output below with merge_variant_noncoding set to 'both'.

test/output/test-integration-merge/call-NonCanonicalPeptide-1.0.0/UCLA0001/moPepGen-0.11.3/output/
├── decoy
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_circRNA_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_Fusion_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_gINDEL_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_gSNP_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_gSNP_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_Noncoding_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode_decoy.fasta
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite_encode_decoy.fasta
│   └── UCLA0001_splitFiltered_RNAEditingSite_encode_decoy.fasta.dict
├── encode
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_circRNA_encode.fasta
│   ├── UCLA0001_splitFiltered_circRNA_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion_encode.fasta
│   ├── UCLA0001_splitFiltered_Fusion_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode.fasta
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode.fasta
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_gINDEL_encode.fasta
│   ├── UCLA0001_splitFiltered_gINDEL_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_gSNP_encode.fasta
│   ├── UCLA0001_splitFiltered_gSNP_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_Noncoding_encode.fasta
│   ├── UCLA0001_splitFiltered_Noncoding_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode.fasta
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA_encode.fasta.dict
│   ├── UCLA0001_splitFiltered_RNAEditingSite_encode.fasta
│   └── UCLA0001_splitFiltered_RNAEditingSite_encode.fasta.dict
├── splitFiltered
│   ├── UCLA0001_splitFiltered_circRNA.fasta
│   ├── UCLA0001_splitFiltered_Fusion.fasta
│   ├── UCLA0001_splitFiltered_Fusion-Noncoding.fasta
│   ├── UCLA0001_splitFiltered_gINDEL-circRNA.fasta
│   ├── UCLA0001_splitFiltered_gINDEL.fasta
│   ├── UCLA0001_splitFiltered_gSNP.fasta
│   ├── UCLA0001_splitFiltered_Noncoding.fasta
│   ├── UCLA0001_splitFiltered_RNAEditingSite-circRNA.fasta
│   └── UCLA0001_splitFiltered_RNAEditingSite.fasta
├── UCLA0001_merged_peptides.fasta
├── UCLA0001_merged_peptides_filtered.fasta
├── UCLA0001_merged_peptides_filtered_summary.txt
├── UCLA0001_merged_peptides_summary.txt
├── UCLA0001_noncoding_peptides_filtered.fasta
├── UCLA0001_variant_peptides.fasta
├── UCLA0001_variant_peptides_filtered.fasta
├── UCLA0001_variant_peptides_filtered_summary.txt
└── UCLA0001_variant_peptides_summary.txt

Closes #80

Closes #...

lydiayliu commented 1 year ago

Why splitFiltered but not split_filtered to match the rest of the formats?

Can you also try with a sample where split_fasta = true; filter_fasta = true but no expression table is provided? there should be nothing outputted if in split_filtered. I also don't think there should be anything outputted in split?

zhuchcn commented 1 year ago

Why splitFiltered but not split_filtered to match the rest of the formats?

Done! Here is what looks like right now with merge_variant_noncoding = both, split_fasta = true, and filter_fasta = true

test/output/test-integration-merge/call-NonCanonicalPeptide-1.0.0/UCLA0001/moPepGen-0.11.3/output/
├── decoy
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode_decoy.fasta.dict
│   ├── UCLA0001_split_circRNA_encode_decoy.fasta
│   ├── UCLA0001_split_circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode_decoy.fasta
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode_decoy.fasta.dict
│   ├── UCLA0001_split_Fusion_encode_decoy.fasta
│   ├── UCLA0001_split_Fusion_encode_decoy.fasta.dict
│   ├── UCLA0001_split_Fusion-Noncoding_encode_decoy.fasta
│   ├── UCLA0001_split_Fusion-Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_split_gINDEL-circRNA_encode_decoy.fasta
│   ├── UCLA0001_split_gINDEL-circRNA_encode_decoy.fasta.dict
│   ├── UCLA0001_split_gINDEL_encode_decoy.fasta
│   ├── UCLA0001_split_gINDEL_encode_decoy.fasta.dict
│   ├── UCLA0001_split_gSNP_encode_decoy.fasta
│   ├── UCLA0001_split_gSNP_encode_decoy.fasta.dict
│   ├── UCLA0001_split_Noncoding_encode_decoy.fasta
│   ├── UCLA0001_split_Noncoding_encode_decoy.fasta.dict
│   ├── UCLA0001_split_RNAEditingSite_encode_decoy.fasta
│   └── UCLA0001_split_RNAEditingSite_encode_decoy.fasta.dict
├── encode
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta
│   ├── UCLA0001_merged_peptides_filtered_encode.fasta.dict
│   ├── UCLA0001_split_circRNA_encode.fasta
│   ├── UCLA0001_split_circRNA_encode.fasta.dict
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode.fasta
│   ├── UCLA0001_split_circRNA-RNAEditingSite_encode.fasta.dict
│   ├── UCLA0001_split_Fusion_encode.fasta
│   ├── UCLA0001_split_Fusion_encode.fasta.dict
│   ├── UCLA0001_split_Fusion-Noncoding_encode.fasta
│   ├── UCLA0001_split_Fusion-Noncoding_encode.fasta.dict
│   ├── UCLA0001_split_gINDEL-circRNA_encode.fasta
│   ├── UCLA0001_split_gINDEL-circRNA_encode.fasta.dict
│   ├── UCLA0001_split_gINDEL_encode.fasta
│   ├── UCLA0001_split_gINDEL_encode.fasta.dict
│   ├── UCLA0001_split_gSNP_encode.fasta
│   ├── UCLA0001_split_gSNP_encode.fasta.dict
│   ├── UCLA0001_split_Noncoding_encode.fasta
│   ├── UCLA0001_split_Noncoding_encode.fasta.dict
│   ├── UCLA0001_split_RNAEditingSite_encode.fasta
│   └── UCLA0001_split_RNAEditingSite_encode.fasta.dict
├── split_filtered
│   ├── UCLA0001_split_circRNA.fasta
│   ├── UCLA0001_split_circRNA-RNAEditingSite.fasta
│   ├── UCLA0001_split_Fusion.fasta
│   ├── UCLA0001_split_Fusion-Noncoding.fasta
│   ├── UCLA0001_split_gINDEL-circRNA.fasta
│   ├── UCLA0001_split_gINDEL.fasta
│   ├── UCLA0001_split_gSNP.fasta
│   ├── UCLA0001_split_Noncoding.fasta
│   └── UCLA0001_split_RNAEditingSite.fasta
├── UCLA0001_merged_peptides.fasta
├── UCLA0001_merged_peptides_filtered.fasta
├── UCLA0001_merged_peptides_filtered_summary.txt
├── UCLA0001_merged_peptides_summary.txt
├── UCLA0001_noncoding_peptides_filtered.fasta
├── UCLA0001_variant_peptides.fasta
├── UCLA0001_variant_peptides_filtered.fasta
├── UCLA0001_variant_peptides_filtered_summary.txt
└── UCLA0001_variant_peptides_summary.txt
zhuchcn commented 1 year ago

As discussed, we want:

  1. Output decoy-encode and the original fasta of both filtered and unfiltered fasta for split and merge for gvf entrypoint
  2. For fasta entrypoint, we only want the filtered fasta
  3. Modify the metapipeline so it is able to skip samples that don't have exprs_table with fasta entrypoint.
lydiayliu commented 1 year ago
  1. Output plain fasta, decoy / encode fasta of both filtered and unfiltered fasta for split and merge for gvf entrypoint (if filter_fasta = TRUE), in decoy / encode directories have split and split_filtered subdirectories
  2. For fasta entrypoint, we only want the filtered fasta (if filter_fasta = TRUE)
  3. Modify the metapipeline so it is able to skip samples that don't have exprs_table with fasta entrypoint (if filter_fasta = TRUE)
zhuchcn commented 1 year ago

We want to call encodeFasta and decoyFasta on both the unfiltered and filtered fasta. And we currently allow uses to turn of the encode/decoy functions. Should we get ride of params.encode_fasta and params.decoy_fasta? So they will always be called. This just makes the logic simple a little bit.

lydiayliu commented 1 year ago

No for CCLE I specifically turn off encode_fasta and decoy_fasta because I literally don't need all the encode and decoy fastas flotting around... they can't be used as input to merge

zhuchcn commented 1 year ago

I think I implemented the way you want. There are too many output files so the complete tree output probably won't fit here. I put the directory structure for fasta, gvf and parser entrypoint to the file below on the cluster so you can take a look.

/hot/user/czhu/pipeline-call-NoncanonicalPeptide/tree_output.txt

I also changed the parameter from merge_variant_noncoding to database_processing_modes which I think is more reasonable.

lydiayliu commented 1 year ago

We are almost there!! Commenting on the output file

For fasta entry point or process_unfiltered_fasta = FALSE, we also don't need to output the unfiltered merged.fasta

├── UCLA0001_merged_peptides.fasta
├── UCLA0001_merged_peptides_filtered.fasta
├── UCLA0001_merged_peptides_filtered_summary.txt
├── UCLA0001_noncoding_peptides_filtered.fasta                                                                                                                                                 
├── UCLA0001_variant_peptides_filtered.fasta                                                                                                                                                   
├── UCLA0001_variant_peptides_filtered_summary.txt                                                                                                                                             
└── variant_summary.txt                                                                                                                                                                        

Although I suspect that it is easier to output it than not output it... It doesn't hurt but is a bit of waste as a duplicated file.

Otherwise, now noncoding_peptides_filtered.fasta would only be outputted as part of split right?

zhuchcn commented 1 year ago

In the last commit, I added a tag of 'variant_only' to the summarizeFasta output from the 'plain' workflow. Also updated the 'tree_output.txt'. Let me know what you think!

lydiayliu commented 1 year ago

It is a bit long winded but I think it works.

├── UCLA0001_variant_peptides.fasta
├── UCLA0001_variant_peptides_filtered.fasta
├── UCLA0001_variant_peptides_filtered_summary.txt
├── UCLA0001_variant_peptides_filtered_variant_only_summary.txt
└── UCLA0001_variant_peptides_summary.txt

I don't like how the unfiltered summary is just called _variant_peptides_summary.txt but it matches with the fasta. Let's just keep it like this!

zhuchcn commented 1 year ago

I don't like how the unfiltered summary is just called _variant_peptides_summary.txt but it matches with the fasta. Let's just keep it like this!

That's what I was thinking, too, so they can match up.