nf-core / kmermaid

k-mer similarity analysis pipeline
https://nf-co.re/kmermaid
MIT License
19 stars 12 forks source link

[WIP] Adding 10x gz changes with python files moved to bam2fasta repo #63

Closed pranathivemuri closed 4 years ago

pranathivemuri commented 4 years ago

Many thanks to contributing to nf-core/kmer-similarity!

Please fill in the appropriate checklist below (delete whatever is not relevant). These are the most common things requested on pull requests (PRs).

PR checklist

Learn more about contributing: https://github.com/nf-core/kmer-similarity/tree/master/.github/CONTRIBUTING.md

pranathivemuri commented 4 years ago
  1. Change docs to remove bam related barcodes barcodes_renamer file, change tests, change readme, remove test_bam.config
  2. Consider keeping the above and, offering options for both methods in bam2fasta to kmermaid, is it useful for kmermaid to have it? - can it be added later if needed as bam2fasta count_umis_percell and make_fastqs_per_cell already solve that problem
pranathivemuri commented 4 years ago

I ran tests with test_bam.config and test_tenx_gz.config locally, both pass but for 1) the 2nd bam is not processed for 2) fastp and the next processes don't even run

(nextflow) ➜  kmermaid git:(pranathi-10x-gz) ✗ nextflow run main.nf -c conf/test_bam.config
N E X T F L O W  ~  version 19.07.0
Launching `main.nf` [festering_rubens] - revision: ec64a7b804
[2m----------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/kmermaid v1.0.0dev
----------------------------------------------------
Run Name          : festering_rubens
BAM               : [https://github.com/nf-core/test-datasets/raw/kmermaid/testdata/10x-example/possorted_genome_bam.bam, https://github.com/nf-core/test-datasets/raw/olgabot/kmermaid-unaligned-tgz-v2/testdata/mouse_brown_fat_ptprc_plus_unaligned/outs/possorted_genome_bam.bam]
Skip trimming?    : false
K-mer sizes       : 3,9
Molecule          : dna,protein,dayhoff
Log2 Sketch Sizes : 2,4
One Sig per Record: true
Track Abundance   : false
Max Resources     : 6 GB memory, 2 cpus, 2d time per job
Output dir        : ./results
Launch dir        : /Users/pranathivemuri/czbiohub/kmermaid
Working dir       : /Users/pranathivemuri/czbiohub/kmermaid/work
Script dir        : /Users/pranathivemuri/czbiohub/kmermaid
User              : pranathivemuri
Config Profile    : standard
Config Description: Minimal test dataset to check pipeline function
[0m----------------------------------------------------
executor >  local (11)
[ea/344c44] process > get_software_versions                                                                                                      [100%] 1 of 1 ✔
[55/d18a8d] process > bam2fasta (1)                                                                                                              [100%] 1 of 1 ✔
[e2/37b93b] process > fastp (possorted_genome_bam)                                                                                               [100%] 1 of 1 ✔
[04/b774ac] process > sourmash_compute_sketch_fastx_nucleotide (possorted_genome_bam_molecule-dna_ksize-9_log2sketchsize-2_trackabundance-false) [100%] 4 of 4 ✔
[dc/458fb0] process > sourmash_compare_sketches (molecule-dna_ksize-3_log2sketchsize-2_trackabundance-false)                                     [100%] 4 of 4 ✔
[0;35mWarning, pipeline completed, but with errored process(es) 
[0;31mNumber of ignored errored process(es) : 0 
[0;32mNumber of successfully ran process(es) : 11 
[0;35m[nf-core/kmermaid] Pipeline completed successfully
WARN: Task runtime metrics are not reported when using macOS without a container engine

(nextflow) ➜  kmermaid git:(pranathi-10x-gz) ✗ nextflow run main.nf -c conf/test_tenx_tgz.config 
N E X T F L O W  ~  version 19.07.0
Launching `main.nf` [backstabbing_hilbert] - revision: ec64a7b804
[2m----------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/kmermaid v1.0.0dev
----------------------------------------------------
Run Name          : backstabbing_hilbert
Skip trimming?    : false
K-mer sizes       : 3,9
Molecule          : dna,protein,dayhoff
Log2 Sketch Sizes : 2,4
One Sig per Record: true
Track Abundance   : false
10x .tgz          : [https://github.com/nf-core/test-datasets/raw/kmermaid/testdata/mouse_lung.tgz, https://github.com/nf-core/test-datasets/raw/olgabot/kmermaid-unaligned-tgz-v3/testdata/mouse_brown_fat_ptprc_plus_unaligned.tgz]
10x SAM tags      : CB,UB,XC,XM,RG
10x Cell pattern  : (CB|CB):Z:([ACGT]+)(\-1)?
10x UMI pattern   : (UB|XB):Z:([ACGT]+)
Min UMI/cell      : 10
Max Resources     : 6 GB memory, 2 cpus, 2d time per job
Output dir        : ./results
Launch dir        : /Users/pranathivemuri/czbiohub/kmermaid
Working dir       : /Users/pranathivemuri/czbiohub/kmermaid/work
Script dir        : /Users/pranathivemuri/czbiohub/kmermaid
User              : pranathivemuri
Config Profile    : standard
Config Description: Minimal test dataset to check pipeline function
[0m----------------------------------------------------
executor >  local (12)
[c8/7e80f3] process > get_software_versions                                                   [100%] 1 of 1 ✔
[59/64d418] process > tenx_tgz_extract_bam (mouse_brown_fat_ptprc_plus_unaligned)             [100%] 2 of 2 ✔
[ed/2b5043] process > samtools_fastq_aligned (mouse_brown_fat_ptprc_plus_unaligned)           [100%] 2 of 2 ✔
[a7/2039d6] process > samtools_fastq_unaligned (mouse_brown_fat_ptprc_plus_unaligned)         [100%] 2 of 2 ✔
[89/82046a] process > count_umis_per_cell (mouse_brown_fat_ptprc_plus_unaligned__aligned)     [100%] 2 of 2 ✔
[d1/dbc2d3] process > extract_per_cell_fastqs (mouse_brown_fat_ptprc_plus_unaligned__aligned) [100%] 3 of 3 ✔
[-        ] process > fastp                                                                   -
[-        ] process > sourmash_compute_sketch_fastx_nucleotide                                -
[-        ] process > sourmash_compare_sketches                                               -
[0;35mWarning, pipeline completed, but with errored process(es) 
[0;31mNumber of ignored errored process(es) : 0 
[0;32mNumber of successfully ran process(es) : 12 
[0;35m[nf-core/kmermaid] Pipeline completed successfully
WARN: Task runtime metrics are not reported when using macOS without a container engine
pranathivemuri commented 4 years ago
  1. For bam.config, I think the error could be because the filenames are same possorted_genome_bam.bam

  2. For the fastp, and next processes not getting executed for test_tenx_gz.config, I think its because of these lines here. // Make per-cell fastqs into a flat channel that matches the read channels of yore per_channel_cell_reads_ch .dump(tag: 'per_channel_cell_reads_ch') .flatten() .filter{ it -> it.size() > 0 } // each item is just a single file, no need to do it[1] .map{ it -> tuple(it.simpleName, file(it)) } .dump(tag: 'per_cell_fastqs_ch') .set{ per_cell_fastqs_ch }

olgabot commented 4 years ago
  1. For bam.config, I think the error could be because the filenames are same possorted_genome_bam.bam

Can there be a test added with different filenames?

  1. For the fastp, and next processes not getting executed for test_tenx_gz.config, I think its because of these lines here. // Make per-cell fastqs into a flat channel that matches the read channels of yore per_channel_cell_reads_ch .dump(tag: 'per_channel_cell_reads_ch') .flatten() .filter{ it -> it.size() > 0 } // each item is just a single file, no need to do it[1] .map{ it -> tuple(it.simpleName, file(it)) } .dump(tag: 'per_cell_fastqs_ch') .set{ per_cell_fastqs_ch }

Does that mean that all of the fastqs are empty?

snafees commented 4 years ago

Hey @pranathivemuri, just ran from your branch and this is the error I get:

``` saba@lrrr:~/code/tabula-microcebus/workflows/kmermaid/10x/stumpy-bernard-lung$ make nextflow run \ nf-core/kmermaid -r pranathi-10x-gz -latest -profile docker -work-dir /mnt/data_lg/saba/nextflow-intermediates/ -resume -with-tower --track_abundance --peptide_fasta /home/olga/data_lg/czbiohub-reference/uniprot/releases/2019_11/manually_downloaded/uniprot-reviewed_yes+taxonomy_2759.fasta.gz --extract_coding_peptide_ksize 9 --extract_coding_jaccard_threshold 0.95 --molecules dna,protein,dayhoff --ksizes 21,24,27,33,36,39,42,45,51 --log2_sketch_sizes 14 -dump-channels \ --bam /mnt/data_sm/olga/tabula-microcebus/rawdata/tenx/lung-soft-links/lemur-bams-from-michael/'**for_kmermaid.bam' \ --outdir /mnt/data_sm/olga/tabula-microcebus/analyses/kmermaid/tenx-bam-- stumpy-bernard-lung N E X T F L O W ~ version 20.01.0 Pulling nf-core/kmermaid ... Already-up-to-date Launching `nf-core/kmermaid` [thirsty_colden] - revision: 4347fa7062 [pranathi-10x-gz] [DUMP: bam] /mnt/data_sm/olga/tabula-microcebus/rawdata/tenx/lung-soft-links/lemur-bams-from-michael/bernard_lung/10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid.bam [DUMP: bam] /mnt/data_sm/olga/tabula-microcebus/rawdata/tenx/lung-soft-links/lemur-bams-from-michael/ML_Stumpy_lung/10X_P8_0_S1_MW_gene_exon_tagged2_for_kmermaid.bam [2m---------------------------------------------------- ,--./,-. ___ __ __ __ ___ /,-._.--~' |\ | |__ __ / ` / \ |__) |__ } { | \| | \__, \__/ | \ |___ \`-._,-`-, `._,._,' nf-core/kmermaid v1.0.0dev ---------------------------------------------------- Pipeline Release : pranathi-10x-gz Run Name : thirsty_colden BAM : /mnt/data_sm/olga/tabula-microcebus/rawdata/tenx/lung-soft-links/lemur-bams-from-michael/**for_kmermaid.bam Skip trimming? : false K-mer sizes : 21,24,27,33,36,39,42,45,51 Molecule : dna,protein,dayhoff Log2 Sketch Sizes : 14 One Sig per Record: false Track Abundance : true Peptide fasta : /home/olga/data_lg/czbiohub-reference/uniprot/releases/2019_11/manually_downloaded/uniprot-reviewed_yes+taxonomy_2759.fasta.gz Peptide ksize : 7 Peptide molecule : protein Bloom filter table size: 1e8 Max Resources : 128 GB memory, 16 cpus, 10d time per job Container : docker - nfcore/kmermaid:dev Output dir : /mnt/data_sm/olga/tabula-microcebus/analyses/kmermaid/tenx-bam-- Launch dir : /home/saba/code/tabula-microcebus/workflows/kmermaid/10x/stumpy-bernard-lung Working dir : /mnt/data_lg/saba/nextflow-intermediates Script dir : /home/saba/.nextflow/assets/nf-core/kmermaid User : saba Config Profile : docker [0m---------------------------------------------------- Monitor the execution with Nextflow Tower using this url https://tower.nf/watch/E8Yo5DQG executor > local (3) [bb/8f02e8] process > get_software_versions [100%] 1 of 1 ✔ [37/06dbd4] process > make_protein_index [100%] 1 of 1, cached: 1 ✔ [ff/b5089a] process > bam2fasta [ 50%] 1 of 2, failed: 1, retries: 1 [- ] process > fastp - [- ] process > translate - [- ] process > sourmash_compute_sketch_fastx_nucleotide - [- ] process > sourmash_compute_sketch_fastx_peptide - [- ] process > sourmash_compare_sketches - WARN: Access to undefined parameter `save_intermediate_files` -- Initialise it to a default value eg. `params.save_intermediate_files = some_value` [29/8509a7] NOTE: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) -- Execution is retried (1) Error executing process > 'bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)' Caused by: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) Command executed: bam2fasta percell \ --processes 16 \ --min-umi-per-barcode 0 \ --shard-size 350 \ Monitor the execution with Nextflow Tower using this url https://tower.nf/watch/E8Yo5DQG executor > local (3) [bb/8f02e8] process > get_software_versions [100%] 1 of 1 ✔ [37/06dbd4] process > make_protein_index [100%] 1 of 1, cached: 1 ✔ [ff/b5089a] process > bam2fasta [100%] 2 of 2, failed: 2, retries: 1 ✘ [- ] process > fastp - [- ] process > translate - [- ] process > sourmash_compute_sketch_fastx_nucleotide - [- ] process > sourmash_compute_sketch_fastx_peptide - [- ] process > sourmash_compare_sketches - [0;35m[nf-core/kmermaid] Pipeline completed with errors WARN: Access to undefined parameter `save_intermediate_files` -- Initialise it to a default value eg. `params.save_intermediate_files = some_value` [29/8509a7] NOTE: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) -- Execution is retried (1) Error executing process > 'bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)' Caused by: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) Command executed: bam2fasta percell \ --processes 16 \ --min-umi-per-barcode 0 \ --shard-size 350 \ Monitor the execution with Nextflow Tower using this url https://tower.nf/watch/E8Yo5DQG executor > local (3) [bb/8f02e8] process > get_software_versions [100%] 1 of 1 ✔ [37/06dbd4] process > make_protein_index [100%] 1 of 1, cached: 1 ✔ [ff/b5089a] process > bam2fasta [100%] 2 of 2, failed: 2, retries: 1 ✘ [- ] process > fastp - [- ] process > translate - [- ] process > sourmash_compute_sketch_fastx_nucleotide - [- ] process > sourmash_compute_sketch_fastx_peptide - [- ] process > sourmash_compare_sketches - [0;35m[nf-core/kmermaid] Pipeline completed with errors WARN: Access to undefined parameter `save_intermediate_files` -- Initialise it to a default value eg. `params.save_intermediate_files = some_value` [29/8509a7] NOTE: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) -- Execution is retried (1) WARN: Tower request field `workflow.errorMessage` exceeds expected size | offending value: `WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. usage: bam2fasta [--filename FILENAME] [--min-umi-per-barcode MIN_UMI_PER_BARCODE] [--write-barcode-meta-csv WRITE_BARCODE_META_CSV] [-p PROCESSES] [--delimiter DELIMITER] [--save-fastas SAVE_FASTAS] [--save-intermediate-files SAVE_INTERMEDIATE_FILES] [--line-count LINE_COUNT] [--cell-barcode-pattern CELL_BARCODE_PATTERN] [--molecular-barcode-pattern MOLECULAR_BARCODE_PATTERN] [--rename-10x-barcodes RENAME_10X_BARCODES] [--barcodes-file BARCODES_FILE] [--method METHOD] bam2fasta: error: unrecognized arguments: --shard-size 350 --output-format fastq.gz --channel-id 10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid`, size: 910 (max: 255) Error executing process > 'bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)' Caused by: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) Command executed: bam2fasta percell \ --processes 16 \ --min-umi-per-barcode 0 \ --shard-size 350 \ Monitor the execution with Nextflow Tower using this url https://tower.nf/watch/E8Yo5DQG executor > local (3) [bb/8f02e8] process > get_software_versions [100%] 1 of 1 ✔ [37/06dbd4] process > make_protein_index [100%] 1 of 1, cached: 1 ✔ [ff/b5089a] process > bam2fasta [100%] 2 of 2, failed: 2, retries: 1 ✘ [- ] process > fastp - [- ] process > translate - [- ] process > sourmash_compute_sketch_fastx_nucleotide - [- ] process > sourmash_compute_sketch_fastx_peptide - [- ] process > sourmash_compare_sketches - [0;35m[nf-core/kmermaid] Pipeline completed with errors WARN: Access to undefined parameter `save_intermediate_files` -- Initialise it to a default value eg. `params.save_intermediate_files = some_value` [29/8509a7] NOTE: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) -- Execution is retried (1) WARN: Tower request field `workflow.errorMessage` exceeds expected size | offending value: `WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. usage: bam2fasta [--filename FILENAME] [--min-umi-per-barcode MIN_UMI_PER_BARCODE] [--write-barcode-meta-csv WRITE_BARCODE_META_CSV] [-p PROCESSES] [--delimiter DELIMITER] [--save-fastas SAVE_FASTAS] [--save-intermediate-files SAVE_INTERMEDIATE_FILES] [--line-count LINE_COUNT] [--cell-barcode-pattern CELL_BARCODE_PATTERN] [--molecular-barcode-pattern MOLECULAR_BARCODE_PATTERN] [--rename-10x-barcodes RENAME_10X_BARCODES] [--barcodes-file BARCODES_FILE] [--method METHOD] bam2fasta: error: unrecognized arguments: --shard-size 350 --output-format fastq.gz --channel-id 10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid`, size: 910 (max: 255) WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info. Error executing process > 'bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)' Caused by: Process `bam2fasta (10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid)` terminated with an error exit status (2) Command executed: bam2fasta percell \ --processes 16 \ --min-umi-per-barcode 0 \ --shard-size 350 \ \ --min-umi-per-barcode 0 \ --shard-size 350 \ \ \ --save-fastas fastas \ --output-format fastq.gz \ --channel-id 10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid \ --save-intermediate-files null \ \ --filename 10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid.bam find fastas/ -type f -name "*.fastq.gz" | while read src; do if [[ $src == *"|"* ]]; then mv "$src" $(echo "$src" | tr "|" "_"); fi done Command exit status: 2 Command output: (empty) Command error: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. usage: bam2fasta [--filename FILENAME] [--min-umi-per-barcode MIN_UMI_PER_BARCODE] [--write-barcode-meta-csv WRITE_BARCODE_META_CSV] [-p PROCESSES] [--delimiter DELIMITER] [--save-fastas SAVE_FASTAS] [--save-intermediate-files SAVE_INTERMEDIATE_FILES] [--line-count LINE_COUNT] [--cell-barcode-pattern CELL_BARCODE_PATTERN] [--molecular-barcode-pattern MOLECULAR_BARCODE_PATTERN] [--rename-10x-barcodes RENAME_10X_BARCODES] [--barcodes-file BARCODES_FILE] [--method METHOD] bam2fasta: error: unrecognized arguments: --shard-size 350 --output-format fastq.gz --channel-id 10X_P3_12_S13_MW_gene_exon_tagged2__for_kmermaid Work dir: /mnt/data_lg/saba/nextflow-intermediates/ff/b5089a97fd6ae8d891d0d48f803ab9 Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run` Makefile:28: recipe for target 'stumpy-bernard-lung' failed make: *** [stumpy-bernard-lung] Error 1 ```
pranathivemuri commented 4 years ago

@snafees looks like bam2fasta is not up to date on this docker container. could you pull the docker container and retry on lrrr