microbiomedata / metaPro

Workflow for meta-proteomics analysis
BSD 2-Clause "Simplified" License
6 stars 3 forks source link

Writing Cromwell-WDL to execute commands for multiple containers. #9

Closed anubhav0fnu closed 3 years ago

anubhav0fnu commented 3 years ago

A metaP workflow written in WDL is needed.

ssarrafan commented 3 years ago

@anubhav0fnu can this issue be closed or would you like it to be moved to the August sprint?

ssarrafan commented 3 years ago

Moved to August sprint per Slack message from @anubhav0fnu

anubhav0fnu commented 3 years ago

@scanon , @hubin-keio , @Michal-Babins Following up on the Aug 23rd meeting.

Question: what're are the input and outputs to the [shell script's each command] (https://github.com/microbiomedata/metaPro/blob/master/run_tasks.sh)?

Answer:

Processing only for stegen/500088 (test dataset).


INPUT:

.
├── data
│ └── set_of_Dataset_IDs
│     └── stegen
│         └── 500088
│             └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.raw
├── fastas
│ └── stegen
│     └── 1781_100336
│         ├── Ga0482236_functional_annotation.gff
│         └── Ga0482236_proteins.faa
├── mappings
│   └── EMSL48473_JGI1781_Stegen_DatasetToMetagenomeMapping_2021-01-25.xlsx
└── parameters
    ├── LTQ-FT_10ppm_2014-08-06.xml
    ├── MSGFPlus_PartTryp_MetOx_20ppmParTol.txt
    ├── MSGFPlus_PartTryp_MetOx_20ppmParTol_ModDefs.txt
    ├── MSGFPlus_Tryp_NoMods_20ppmParTol.txt
    ├── Mass_Correction_Tags.txt
    └── Tryp_Pig_Bov.fasta

docker exec -it analysisJobContainer python3.8 ./metaPro/src/prepare_input/emsl_to_jgi.py

OUTPUT: emsl_to_jgi.json


INPUT: emsl_to_jgi.json

docker exec -it analysisJobContainer python3.8 ./metaPro/src/analysis_jobs/run_analysis_job.py

OUTPUT:

.
└── 1781_100336
    ├── analysis_jobs_logs
    │   ├── 0_masic.commandlog
    │   ├── 0_masic.log
    │   ├── 1_MSconvert.log
    │   ├── 2_MSGFPlus.log
    │   ├── 3_MzidToTsvConverter.log
    │   ├── 4_TsvToSynConverter.commandlog
    │   ├── 4_TsvToSynConverter.log
    │   └── ProteinDigestionSimulator.log
    ├── merged_jobs
    │   └── 500088_1781_100336_MSGFjobs_MASIC_resultant.tsv
    ├── msgfplus_input
    │   └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.mzML
    ├── msgfplus_output
    │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.mzid
    │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.tsv
    │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_PepToProtMapMTS.txt
    │   └── fasta_residuals
    │       ├── Ga0482236_proteins.faa
    │       ├── Ga0482236_proteins.revCat.canno
    │       ├── Ga0482236_proteins.revCat.cnlcp
    │       ├── Ga0482236_proteins.revCat.csarr
    │       ├── Ga0482236_proteins.revCat.cseq
    │       └── Ga0482236_proteins.revCat.fasta
    ├── nmdc_jobs
    │   ├── SIC
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_DatasetInfo.xml
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_MSMS_scans.csv
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_MS_scans.csv
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_SICs.xml
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_SICstats.txt
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStats.txt
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStatsConstant.txt
    │   │   ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStatsEx.txt
    │   │   └── index.html
    │   └── SYNOPSIS
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_fht.txt
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn.txt
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ModDetails.txt
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ModSummary.txt
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ProteinMods.txt
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ResultToSeqMap.txt
    │       ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_SeqInfo.txt
    │       └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_SeqToProteinMap.txt
    ├── protein_digestion
    │   └── Ga0482236_proteins.txt

and

modified emsl_to_jgi.json


INPUT: modified emsl_to_jgi.json

docker exec -it postProcessingContainer python ./metaPro/src/post_processing/run_fa.py

OUTPUT:

    └── reports
        ├── 500088_1781_100336_Peptide_Report.tsv
        ├── 500088_1781_100336_Protein_Report.tsv
        └── 500088_1781_100336_QC_metrics.tsv

and

modified emsl_to_jgi.json


INPUT: modified emsl_to_jgi.json

docker exec -it postProcessingContainer python ./metaPro/src/metadata_collection/gen_meta_data.py

OUTPUT: modified emsl_to_jgi.json

├── stegen_MetaProteomicAnalysis_activity.json
└── stegen_emsl_analysis_data_objects.json

and

modified emsl_to_jgi.json


FYI, @scanon & @hubin-keio, @Michal-Babins ran the workflow on July 27th and has both the results and data for that particular dataset.

just FYIing, @pdpiehowski , @SamuelPurvine.

hubin-keio commented 3 years ago

Hello, Anubhav,

Thanks for the update. The logic of each script is still hard to follow. Can you start with an expanded legend of the diagram illustrating the metaP workflow? For example, what is the output from MASIC and MSGF+? Among the output files, which is used for peak areas detection, and what is the result file of this step? Thanks.

Regards, Bin

[cid:681764BC-6DCD-475F-8C65-4F898A4F49D3]

On Aug 24, 2021, at 1:52 PM, Anubhav @.**@.>> wrote:

@scanonhttps://github.com/scanon , @hubin-keiohttps://github.com/hubin-keio , @Michal-Babinshttps://github.com/Michal-Babins Following up on the Aug 23rd meeting.

Question: what're are the input and outputs to the [shell script's each command] (https://github.com/microbiomedata/metaPro/blob/master/run_tasks.sh)?

Answer:

Processing only for stegen/500088 (test dataset).


INPUT:

. ├── data │ └── set_of_Dataset_IDs │ └── stegen │ └── 500088 │ └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.raw ├── fastas │ └── stegen │ └── 1781_100336 │ ├── Ga0482236_functional_annotation.gff │ └── Ga0482236_proteins.faa ├── mappings │ └── EMSL48473_JGI1781_Stegen_DatasetToMetagenomeMapping_2021-01-25.xlsx └── parameters ├── LTQ-FT_10ppm_2014-08-06.xml ├── MSGFPlus_PartTryp_MetOx_20ppmParTol.txt ├── MSGFPlus_PartTryp_MetOx_20ppmParTol_ModDefs.txt ├── MSGFPlus_Tryp_NoMods_20ppmParTol.txt ├── Mass_Correction_Tags.txt └── Tryp_Pig_Bov.fasta

docker exec -it analysisJobContainer python3.8 ./metaPro/src/prepare_input/emsl_to_jgi.py

OUTPUT: emsl_to_jgi.json


INPUT: emsl_to_jgi.json

docker exec -it analysisJobContainer python3.8 ./metaPro/src/analysis_jobs/run_analysis_job.py

OUTPUT:

. └── 1781_100336 ├── analysis_jobs_logs │ ├── 0_masic.commandlog │ ├── 0_masic.log │ ├── 1_MSconvert.log │ ├── 2_MSGFPlus.log │ ├── 3_MzidToTsvConverter.log │ ├── 4_TsvToSynConverter.commandlog │ ├── 4_TsvToSynConverter.log │ └── ProteinDigestionSimulator.log ├── merged_jobs │ └── 500088_1781_100336_MSGFjobs_MASIC_resultant.tsv ├── msgfplus_input │ └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.mzML ├── msgfplus_output │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.mzid │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.tsv │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_PepToProtMapMTS.txt │ └── fasta_residuals │ ├── Ga0482236_proteins.faa │ ├── Ga0482236_proteins.revCat.canno │ ├── Ga0482236_proteins.revCat.cnlcp │ ├── Ga0482236_proteins.revCat.csarr │ ├── Ga0482236_proteins.revCat.cseq │ └── Ga0482236_proteins.revCat.fasta ├── nmdc_jobs │ ├── SIC │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_DatasetInfo.xml │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_MSMS_scans.csv │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_MS_scans.csv │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_SICs.xml │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_SICstats.txt │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStats.txt │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStatsConstant.txt │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStatsEx.txt │ │ └── index.html │ └── SYNOPSIS │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_fht.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ModDetails.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ModSummary.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ProteinMods.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ResultToSeqMap.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_SeqInfo.txt │ └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_SeqToProteinMap.txt ├── protein_digestion │ └── Ga0482236_proteins.txt

and

modified emsl_to_jgi.json


INPUT: modified emsl_to_jgi.json

docker exec -it postProcessingContainer python ./metaPro/src/post_processing/run_fa.py

OUTPUT:

└── reports
    ├── 500088_1781_100336_Peptide_Report.tsv
    ├── 500088_1781_100336_Protein_Report.tsv
    └── 500088_1781_100336_QC_metrics.tsv

and

modified emsl_to_jgi.json


INPUT: modified emsl_to_jgi.json

docker exec -it postProcessingContainer python ./metaPro/src/metadata_collection/gen_meta_data.py

OUTPUT: modified emsl_to_jgi.json

├── stegen_MetaProteomicAnalysis_activity.json └── stegen_emsl_analysis_data_objects.json

and

modified emsl_to_jgi.json


FYI, @scanonhttps://github.com/scanon & @hubin-keiohttps://github.com/hubin-keio, @Michal-Babinshttps://github.com/Michal-Babins ran the workflow on July 27th and has both the results and data for that particular dataset.

just FYIing, @pdpiehowskihttps://github.com/pdpiehowski , @SamuelPurvinehttps://github.com/SamuelPurvine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/microbiomedata/metaPro/issues/9#issuecomment-904929375, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB6YE7QIVFA7XQNF5AD4KYTT6P2BXANCNFSM5AHWEZBA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

anubhav0fnu commented 3 years ago

Hello @hubin-keio

I think your requests are out of the scope of work. If you need further assistance, please do contact task leads. The codebase is open-source & you're free to spend time with the codebase and learn parts of it by yourself. I can't dedicate my time to educate more about the workflow other than what's provided in the project directory.

Additionally, this issue is created by me & I assigned it to myself & I'm working on it. I'm not aware of any assistance requested from our team's side to you, if needed we'll connect as per the guidelines defined under this collaborative project."

Hello, Anubhav, Thanks for the update. The logic of each script is still hard to follow. Can you start with an expanded legend of the diagram illustrating the metaP workflow? For example, what is the output from MASIC and MSGF+? Among the output files, which is used for peak areas detection, and what is the result file of this step? Thanks. Regards, Bin [cid:681764BC-6DCD-475F-8C65-4F898A4F49D3] On Aug 24, 2021, at 1:52 PM, Anubhav @.**@.>> wrote: @scanonhttps://github.com/scanon , @hubin-keiohttps://github.com/hubin-keio , @Michal-Babinshttps://github.com/Michal-Babins Following up on the Aug 23rd meeting. Question: what're are the input and outputs to the [shell script's each command] (https://github.com/microbiomedata/metaPro/blob/master/run_tasks.sh)? Answer: Processing only for stegen/500088 (test dataset). ____ INPUT: . ├── data │ └── set_of_Dataset_IDs │ └── stegen │ └── 500088 │ └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.raw ├── fastas │ └── stegen │ └── 1781_100336 │ ├── Ga0482236_functional_annotation.gff │ └── Ga0482236_proteins.faa ├── mappings │ └── EMSL48473_JGI1781_Stegen_DatasetToMetagenomeMapping_2021-01-25.xlsx └── parameters ├── LTQ-FT_10ppm_2014-08-06.xml ├── MSGFPlus_PartTryp_MetOx_20ppmParTol.txt ├── MSGFPlus_PartTryp_MetOx_20ppmParTol_ModDefs.txt ├── MSGFPlus_Tryp_NoMods_20ppmParTol.txt ├── Mass_Correction_Tags.txt └── Tryp_Pig_Bov.fasta docker exec -it analysisJobContainer python3.8 ./metaPro/src/prepare_input/emsl_to_jgi.py OUTPUT: emsl_to_jgi.json ____ INPUT: emsl_to_jgi.json docker exec -it analysisJobContainer python3.8 ./metaPro/src/analysis_jobs/run_analysis_job.py OUTPUT: . └── 1781_100336 ├── analysis_jobs_logs │ ├── 0_masic.commandlog │ ├── 0_masic.log │ ├── 1_MSconvert.log │ ├── 2_MSGFPlus.log │ ├── 3_MzidToTsvConverter.log │ ├── 4_TsvToSynConverter.commandlog │ ├── 4_TsvToSynConverter.log │ └── ProteinDigestionSimulator.log ├── merged_jobs │ └── 500088_1781_100336_MSGFjobs_MASIC_resultant.tsv ├── msgfplus_input │ └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.mzML ├── msgfplus_output │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.mzid │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39.tsv │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_PepToProtMapMTS.txt │ └── fasta_residuals │ ├── Ga0482236_proteins.faa │ ├── Ga0482236_proteins.revCat.canno │ ├── Ga0482236_proteins.revCat.cnlcp │ ├── Ga0482236_proteins.revCat.csarr │ ├── Ga0482236_proteins.revCat.cseq │ └── Ga0482236_proteins.revCat.fasta ├── nmdc_jobs │ ├── SIC │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_DatasetInfo.xml │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_MSMS_scans.csv │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_MS_scans.csv │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_SICs.xml │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_SICstats.txt │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStats.txt │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStatsConstant.txt │ │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_ScanStatsEx.txt │ │ └── index.html │ └── SYNOPSIS │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_fht.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ModDetails.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ModSummary.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ProteinMods.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_ResultToSeqMap.txt │ ├── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_SeqInfo.txt │ └── Froze_Core_2015_N2_50_60_34_QE_26May16_Pippin_16-03-39_syn_SeqToProteinMap.txt ├── protein_digestion │ └── Ga0482236_proteins.txt and modified emsl_to_jgi.json ____ INPUT: modified emsl_to_jgi.json docker exec -it postProcessingContainer python ./metaPro/src/post_processing/run_fa.py OUTPUT: └── reports ├── 500088_1781_100336_Peptide_Report.tsv ├── 500088_1781_100336_Protein_Report.tsv └── 500088_1781_100336_QC_metrics.tsv and modified emsl_to_jgi.json ____ INPUT: modified emsl_to_jgi.json docker exec -it postProcessingContainer python ./metaPro/src/metadata_collection/gen_meta_data.py OUTPUT: modified emsl_to_jgi.json ├── stegen_MetaProteomicAnalysis_activity.json └── stegen_emsl_analysis_data_objects.json and modified emsl_to_jgi.json ____ FYI, @scanonhttps://github.com/scanon & @hubin-keiohttps://github.com/hubin-keio, @Michal-Babinshttps://github.com/Michal-Babins ran the workflow on July 27th and has both the results and data for that particular dataset. just FYIing, @pdpiehowskihttps://github.com/pdpiehowski , @SamuelPurvinehttps://github.com/SamuelPurvine. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#9 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB6YE7QIVFA7XQNF5AD4KYTT6P2BXANCNFSM5AHWEZBA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email.

ssarrafan commented 3 years ago

Moving this out of this sprint per Slack message from Anubhav. It's now labeled with 'backlog' and will pull into an appropriate sprint in the future.

ssarrafan commented 3 years ago

Moving to in progress per @anubhav0fnu who said it would be closed soon

anubhav0fnu commented 3 years ago

@ssarrafan, @scanon I rolled out the metaPro WDL.

ssarrafan commented 3 years ago

@ssarrafan, @scanon I rolled out the metaPro WDL.

Thanks @anubhav0fnu I will close this one but if you need a new issue related for November let me know