jodyphelan / TBProfiler

Profiling tool for Mycobacterium tuberculosis to detect ressistance and strain type from WGS data
GNU General Public License v3.0
105 stars 43 forks source link

When I add --calling-params, errors are always reported. #247

Closed denglele0408 closed 2 years ago

denglele0408 commented 2 years ago

Hi @jodyphelan,

We want to add some parameters when calling SNP by freebayes, for example --min-mapping-quality 60. We have tried the following but failed. How do we add parameters at --calling-params?

batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params 'min-mapping-quality>60'" error:Failed to open file "min-mapping-quality>60" : No such file or directory ERROR(freebayes): Could not open input BAM file: min-mapping-quality>60

batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params -min-mapping-quality>60" error:tb-profiler profile: error: argument --calling_params: expected one argument

batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params --min-mapping-quality>60" error:tb-profiler profile: error: argument --calling_params: expected one argument

batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality>60'" error:freebayes: unrecognized option '--min-mapping-quality>60' did you mean --min-mapping-quality ?

batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params(--min-mapping-quality>60 --min-base-quality>13 --min-coverage>10)" error:tb-profiler: error: unrecognized arguments: --calling_params(--min-mapping-quality>60 --min-base-quality>13 --min-coverage>10)

denglele0408 commented 2 years ago

"batchtb" is a command we wrote for batch processing

jodyphelan commented 2 years ago

Hi @denglele0408

Could you try this?

batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60'"
denglele0408 commented 2 years ago

Hi @jodyphelan

I have tried to input in this way, but the output result still shows an error, as shown below.

(base) batchtb --input=/mnt/f/2619001/ --output=/mnt/f/2619001/ -tp-opt="--call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60'" 2022/10/10 08:04:38 rm -rf ./bam (in dir /home/mtb/tmp) 2022/10/10 08:04:38 rm -rf ./vcf (in dir /home/mtb/tmp) 2022/10/10 08:04:38 rm -rf ./results (in dir /home/mtb/tmp) 2022/10/10 08:04:38 rm -rf ./extra (in dir /home/mtb/tmp) 2022/10/10 08:04:38 准备参考基因组索引 2022/10/10 08:04:38 参考基因索引已经存在, 直接使用 2022/10/10 08:04:38 processing sample 2619001 (1/2)... 2022/10/10 08:04:38 cp -f /mnt/f/2619001/2619001_1.fq.gz /home/mtb/tmp/2619001_1.fq.gz 2022/10/10 08:04:48 cp -f /mnt/f/2619001/2619001_2.fq.gz /home/mtb/tmp/2619001_2.fq.gz 2022/10/10 08:04:59 tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /home/mtb/tmp/2619001_1.fq.gz -2 /home/mtb/tmp/2619001_2.fq.gz --prefix 2619001 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60' (in dir /home/mtb/tmp) usage: tb-profiler [-h] [--version] {profile,vcf_profile,fasta_profile,lineage,spoligotype,collate,reprofile,reformat,create_db,load_library,update_tbdb,version} ... tb-profiler: error: unrecognized arguments: 60' 2022/10/10 08:04:59 error: tb-profiler: exit status 2: tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /home/mtb/tmp/2619001_1.fq.gz -2 /home/mtb/tmp/2619001_2.fq.gz --prefix 2619001 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60' 2022/10/10 08:04:59 21.657367404s 2022/10/10 08:04:59 write: /mnt/f/2619001/results/2619001.error.log 2022/10/10 08:05:00 processing sample 2619002 (2/2)... 2022/10/10 08:05:00 cp -f /mnt/f/2619001/2619002_1.fq.gz /home/mtb/tmp/2619002_1.fq.gz 2022/10/10 08:05:10 cp -f /mnt/f/2619001/2619002_2.fq.gz /home/mtb/tmp/2619002_2.fq.gz 2022/10/10 08:05:21 tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /home/mtb/tmp/2619002_1.fq.gz -2 /home/mtb/tmp/2619002_2.fq.gz --prefix 2619002 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60' (in dir /home/mtb/tmp) usage: tb-profiler [-h] [--version] {profile,vcf_profile,fasta_profile,lineage,spoligotype,collate,reprofile,reformat,create_db,load_library,update_tbdb,version} ... tb-profiler: error: unrecognized arguments: 60' 2022/10/10 08:05:21 error: tb-profiler: exit status 2: tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /home/mtb/tmp/2619002_1.fq.gz -2 /home/mtb/tmp/2619002_2.fq.gz --prefix 2619002 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60' 2022/10/10 08:05:21 21.66531546s 2022/10/10 08:05:21 write: /mnt/f/2619001/results/2619002.error.log 2022/10/10 08:05:21 tb-profiler collate --external_db /home/mtb/tbdb/tbdb (in dir /mnt/f/2619001/) Using gff file: /home/mtb/tbdb/tbdb.gff Using ref file: /home/mtb/tbdb/tbdb.fasta Using barcode file: /home/mtb/tbdb/tbdb.barcode.bed Using bed file: /home/mtb/tbdb/tbdb.bed Using json_db file: /home/mtb/tbdb/tbdb.dr.json Using version file: /home/mtb/tbdb/tbdb.version.json Using spacers file: /home/mtb/tbdb/tbdb.spoligotype_spacers.txt Using variables file: /home/mtb/tbdb/tbdb.variables.json 0it [00:00, ?it/s] 2022/10/10 08:05:21 [1/2] skip 2619001: open /mnt/f/2619001/results/2619001.results.json: no such file or directory 2022/10/10 08:05:21 [2/2] skip 2619002: open /mnt/f/2619001/results/2619002.results.json: no such file or directory 2022/10/10 08:05:21 generating /mnt/f/2619001/lineage.csv ... 2022/10/10 08:05:21 error: result vcf not found: /mnt/f/2619001/vcf/2619001.targets.csq.vcf.gz 2022/10/10 08:05:21 error: result vcf not found: /mnt/f/2619001/vcf/2619002.targets.csq.vcf.gz 2022/10/10 08:05:21 bcftools merge -Oz -o all.merged.vcf.gz (in dir /mnt/f/2619001/vcf)

About: Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file. Note that only records from different files can be merged, never from the same file. For "vertical" merge take a look at "bcftools norm" instead. Usage: bcftools merge [options] <A.vcf.gz> <B.vcf.gz> [...]

Options: --force-samples Resolve duplicate sample names --print-header Print only the merged header and exit --use-header FILE Use the provided header -0 --missing-to-ref Assume genotypes at missing sites are 0/0 -f, --apply-filters LIST Require at least one of the listed FILTER strings (e.g. "PASS,.") -F, --filter-logic x|+ Remove filters if some input is PASS ("x"), or apply all filters ("+") [+] -g, --gvcf -|REF.FA Merge gVCF blocks, INFO/END tag is expected. Implies -i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max -i, --info-rules TAG:METHOD,.. Rules for merging INFO fields (method is one of sum,avg,min,max,join) or "-" to turn off the default [DP:sum,DP4:sum] -l, --file-list FILE Read file names from the file -L, --local-alleles INT EXPERIMENTAL: if more than ALT alleles are encountered, drop FMT/PL and output LAA+LPL instead; 0=unlimited [0] -m, --merge STRING Allow multiallelic records for <snps|indels|both|all|none|id>, see man page for details [both] --no-index Merge unindexed files, the same chromosomal order is required and -r/-R are not allowed --no-version Do not append version and command line to the header -o, --output FILE Write output to a file [standard output] -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] -r, --regions REGION Restrict to comma-separated list of regions -R, --regions-file FILE Restrict to regions listed in a file --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] --threads INT Use multithreading with worker threads [0]

2022/10/10 08:05:21 error exit status 1: bcftools merge -Oz -o all.merged.vcf.gz 2022/10/10 08:05:21 bcftools merge -Oz -o all.merged.targets.csq.vcf.gz (in dir /mnt/f/2619001/vcf)

About: Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file. Note that only records from different files can be merged, never from the same file. For "vertical" merge take a look at "bcftools norm" instead. Usage: bcftools merge [options] <A.vcf.gz> <B.vcf.gz> [...]

Options: --force-samples Resolve duplicate sample names --print-header Print only the merged header and exit --use-header FILE Use the provided header -0 --missing-to-ref Assume genotypes at missing sites are 0/0 -f, --apply-filters LIST Require at least one of the listed FILTER strings (e.g. "PASS,.") -F, --filter-logic x|+ Remove filters if some input is PASS ("x"), or apply all filters ("+") [+] -g, --gvcf -|REF.FA Merge gVCF blocks, INFO/END tag is expected. Implies -i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max -i, --info-rules TAG:METHOD,.. Rules for merging INFO fields (method is one of sum,avg,min,max,join) or "-" to turn off the default [DP:sum,DP4:sum] -l, --file-list FILE Read file names from the file -L, --local-alleles INT EXPERIMENTAL: if more than ALT alleles are encountered, drop FMT/PL and output LAA+LPL instead; 0=unlimited [0] -m, --merge STRING Allow multiallelic records for <snps|indels|both|all|none|id>, see man page for details [both] --no-index Merge unindexed files, the same chromosomal order is required and -r/-R are not allowed --no-version Do not append version and command line to the header -o, --output FILE Write output to a file [standard output] -O, --output-type u|b|v|z[0-9] u/b: un/compressed BCF, v/z: un/compressed VCF, 0-9: compression level [v] -r, --regions REGION Restrict to comma-separated list of regions -R, --regions-file FILE Restrict to regions listed in a file --regions-overlap 0|1|2 Include if POS in the region (0), record overlaps (1), variant overlaps (2) [1] --threads INT Use multithreading with worker threads [0]

2022/10/10 08:05:22 error exit status 1: bcftools merge -Oz -o all.merged.targets.csq.vcf.gz 2022/10/10 08:05:22 java -Xmx4g -jar /home/mtb/snpEff/snpEff.jar -i vcf Mycobacterium_tuberculosis_h37rv all.merged.vcf.gz (in dir /mnt/f/2619001/vcf) Error : Cannot read input file 'all.merged.vcf.gz' Command line : SnpEff -i vcf Mycobacterium_tuberculosis_h37rv all.merged.vcf.gz

snpEff version SnpEff 5.1 (build 2022-01-21 06:23), by Pablo Cingolani Usage: snpEff [eff] [options] genome_version [input_file]

    variants_file                   : Default is STDIN

Options: -chr : Prepend 'string' to chromosome name (e.g. 'chr1' instead of '1'). Only on TXT output. -classic : Use old style annotations instead of Sequence Ontology and Hgvs. -csvStats : Create CSV summary file. -download : Download reference genome if not available. Default: true -i : Input format [ vcf, bed ]. Default: VCF. -fileList : Input actually contains a list of files to process. -o : Ouput format [ vcf, gatk, bed, bedAnn ]. Default: VCF. -s , -stats, -htmlStats : Create HTML summary file. Default is 'snpEff_summary.html' -noStats : Do not create stats (summary) file

Results filter options: -fi , -filterInterval : Only analyze changes that intersect with the intervals specified in this file (you may use this option many times) -no-downstream : Do not show DOWNSTREAM changes -no-intergenic : Do not show INTERGENIC changes -no-intron : Do not show INTRON changes -no-upstream : Do not show UPSTREAM changes -no-utr : Do not show 5_PRIME_UTR or 3_PRIME_UTR changes -no : Do not show 'EffectType'. This option can be used several times.

Annotations options: -cancer : Perform 'cancer' comparisons (Somatic vs Germline). Default: false -cancerSamples : Two column TXT file defining 'oringinal \t derived' samples. -fastaProt : Create an output file containing the resulting protein sequences. -formatEff : Use 'EFF' field compatible with older versions (instead of 'ANN'). -geneId : Use gene ID instead of gene name (VCF output). Default: false -hgvs : Use HGVS annotations for amino acid sub-field. Default: true -hgvsOld : Use old HGVS notation. Default: false -hgvs1LetterAa : Use one letter Amino acid codes in HGVS notation. Default: false -hgvsTrId : Use transcript ID in HGVS notation. Default: false -lof : Add loss of function (LOF) and Nonsense mediated decay (NMD) tags. -noHgvs : Do not add HGVS annotations. -noLof : Do not add LOF and NMD annotations. -noShiftHgvs : Do not shift variants according to HGVS notation (most 3prime end). -oicr : Add OICR tag in VCF file. Default: false -sequenceOntology : Use Sequence Ontology terms. Default: true

Generic options: -c , -config : Specify config file -configOption name=value : Override a config file option -d , -debug : Debug mode (very verbose). -dataDir : Override data_dir parameter from config file. -download : Download a SnpEff database, if not available locally. Default: true -nodownload : Do not download a SnpEff database, if not available locally. -h , -help : Show this help and exit -noLog : Do not report usage statistics to server -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode -version : Show version number and exit

Database options: -canon : Only use canonical transcripts. -canonList : Only use canonical transcripts, replace some transcripts using the 'gene_id transcript_id' entries in . -interaction : Annotate using inteactions (requires interaciton database). Default: true -interval : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times) -maxTSL : Only use transcripts having Transcript Support Level lower than . -motif : Annotate using motifs (requires Motif database). Default: true -nextProt : Annotate using NextProt (requires NextProt database). -noGenome : Do not load any genomic database (e.g. annotate using custom files). -noExpandIUB : Disable IUB code expansion in input variants -noInteraction : Disable inteaction annotations -noMotif : Disable motif annotations. -noNextProt : Disable NextProt annotations. -onlyReg : Only use regulation tracks. -onlyProtein : Only use protein coding transcripts. Default: false -onlyTr : Only use the transcripts in this file. Format: One transcript ID per line. -reg : Regulation track to use (this option can be used add several times). -ss , -spliceSiteSize : Set size for splice sites (donor and acceptor) in bases. Default: 2 -spliceRegionExonSize : Set size for splice site region within exons. Default: 3 bases -spliceRegionIntronMin : Set minimum number of bases for splice site region within intron. Default: 3 bases -spliceRegionIntronMax : Set maximum number of bases for splice site region within intron. Default: 8 bases -strict : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false -ud , -upDownStreamLen : Set upstream downstream interval length (in bases) 2022/10/10 08:05:22 error exit status 255: java -Xmx4g -jar /home/mtb/snpEff/snpEff.jar -i vcf Mycobacterium_tuberculosis_h37rv all.merged.vcf.gz 2022/10/10 08:05:22 java -Xmx4g -jar /home/mtb/snpEff/snpEff.jar -i vcf Mycobacterium_tuberculosis_h37rv all.merged.targets.csq.vcf.gz (in dir /mnt/f/2619001/vcf) Error : Cannot read input file 'all.merged.targets.csq.vcf.gz' Command line : SnpEff -i vcf Mycobacterium_tuberculosis_h37rv all.merged.targets.csq.vcf.gz

snpEff version SnpEff 5.1 (build 2022-01-21 06:23), by Pablo Cingolani Usage: snpEff [eff] [options] genome_version [input_file]

    variants_file                   : Default is STDIN

Options: -chr : Prepend 'string' to chromosome name (e.g. 'chr1' instead of '1'). Only on TXT output. -classic : Use old style annotations instead of Sequence Ontology and Hgvs. -csvStats : Create CSV summary file. -download : Download reference genome if not available. Default: true -i : Input format [ vcf, bed ]. Default: VCF. -fileList : Input actually contains a list of files to process. -o : Ouput format [ vcf, gatk, bed, bedAnn ]. Default: VCF. -s , -stats, -htmlStats : Create HTML summary file. Default is 'snpEff_summary.html' -noStats : Do not create stats (summary) file

Results filter options: -fi , -filterInterval : Only analyze changes that intersect with the intervals specified in this file (you may use this option many times) -no-downstream : Do not show DOWNSTREAM changes -no-intergenic : Do not show INTERGENIC changes -no-intron : Do not show INTRON changes -no-upstream : Do not show UPSTREAM changes -no-utr : Do not show 5_PRIME_UTR or 3_PRIME_UTR changes -no : Do not show 'EffectType'. This option can be used several times.

Annotations options: -cancer : Perform 'cancer' comparisons (Somatic vs Germline). Default: false -cancerSamples : Two column TXT file defining 'oringinal \t derived' samples. -fastaProt : Create an output file containing the resulting protein sequences. -formatEff : Use 'EFF' field compatible with older versions (instead of 'ANN'). -geneId : Use gene ID instead of gene name (VCF output). Default: false -hgvs : Use HGVS annotations for amino acid sub-field. Default: true -hgvsOld : Use old HGVS notation. Default: false -hgvs1LetterAa : Use one letter Amino acid codes in HGVS notation. Default: false -hgvsTrId : Use transcript ID in HGVS notation. Default: false -lof : Add loss of function (LOF) and Nonsense mediated decay (NMD) tags. -noHgvs : Do not add HGVS annotations. -noLof : Do not add LOF and NMD annotations. -noShiftHgvs : Do not shift variants according to HGVS notation (most 3prime end). -oicr : Add OICR tag in VCF file. Default: false -sequenceOntology : Use Sequence Ontology terms. Default: true

Generic options: -c , -config : Specify config file -configOption name=value : Override a config file option -d , -debug : Debug mode (very verbose). -dataDir : Override data_dir parameter from config file. -download : Download a SnpEff database, if not available locally. Default: true -nodownload : Do not download a SnpEff database, if not available locally. -h , -help : Show this help and exit -noLog : Do not report usage statistics to server -q , -quiet : Quiet mode (do not show any messages or errors) -v , -verbose : Verbose mode -version : Show version number and exit

Database options: -canon : Only use canonical transcripts. -canonList : Only use canonical transcripts, replace some transcripts using the 'gene_id transcript_id' entries in . -interaction : Annotate using inteactions (requires interaciton database). Default: true -interval : Use a custom intervals in TXT/BED/BigBed/VCF/GFF file (you may use this option many times) -maxTSL : Only use transcripts having Transcript Support Level lower than . -motif : Annotate using motifs (requires Motif database). Default: true -nextProt : Annotate using NextProt (requires NextProt database). -noGenome : Do not load any genomic database (e.g. annotate using custom files). -noExpandIUB : Disable IUB code expansion in input variants -noInteraction : Disable inteaction annotations -noMotif : Disable motif annotations. -noNextProt : Disable NextProt annotations. -onlyReg : Only use regulation tracks. -onlyProtein : Only use protein coding transcripts. Default: false -onlyTr : Only use the transcripts in this file. Format: One transcript ID per line. -reg : Regulation track to use (this option can be used add several times). -ss , -spliceSiteSize : Set size for splice sites (donor and acceptor) in bases. Default: 2 -spliceRegionExonSize : Set size for splice site region within exons. Default: 3 bases -spliceRegionIntronMin : Set minimum number of bases for splice site region within intron. Default: 3 bases -spliceRegionIntronMax : Set maximum number of bases for splice site region within intron. Default: 8 bases -strict : Only use 'validated' transcripts (i.e. sequence has been checked). Default: false -ud , -upDownStreamLen : Set upstream downstream interval length (in bases) 2022/10/10 08:05:22 error exit status 255: java -Xmx4g -jar /home/mtb/snpEff/snpEff.jar -i vcf Mycobacterium_tuberculosis_h37rv all.merged.targets.csq.vcf.gz

denglele0408 commented 2 years ago

The main errors are:

tb-profiler: error: unrecognized arguments: 60' 2022/10/10 08:04:59 error: tb-profiler: exit status 2: tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /home/mtb/tmp/2619001_1.fq.gz -2 /home/mtb/tmp/2619001_2.fq.gz --prefix 2619001 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60'

denglele0408 commented 2 years ago

The main errors are:

tb-profiler: error: unrecognized arguments: 60' 2022/10/10 08:04:59 error: tb-profiler: exit status 2: tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /home/mtb/tmp/2619001_1.fq.gz -2 /home/mtb/tmp/2619001_2.fq.gz --prefix 2619001 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60'

jodyphelan commented 2 years ago

Thanks for the quick feedback, what OS are you running?

Switching the inputs from that command seems to work ok for me:

tb-profiler profile -a ERR1633956.bqsr.cram --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60'

It could be an OS-specific issue maybe?

denglele0408 commented 2 years ago

Hi @jodyphelan

Thank you very much for your detailed answer, which is very helpful to me. I probably know the reason for the error. The error may be caused by the batch command. This error occurs when I add “--calling_params '--min-mapping-quality 60'” to the batch command, but the default parameter does not appear. I tried the following command, and it succeeded. "tb-profiler profile --external_db /home/mtb/tbdb/tbdb -1 /mnt/f/2619001/2619001_1.fq.gz -2 /mnt/f/2619001/2619001_2.fq.gz --prefix 0521006 --call_whole_genome --txt --threads 4 --calling_params '--min-mapping-quality 60'"

jodyphelan commented 2 years ago

No problem! Let me know if there are any more issues.

denglele0408 commented 2 years ago

Dear @jodyphelan

Thank you very much for your guidance and sharing. Your guidance has greatly helped me in my doctoral research. I will express your help and tb-profiler in the acknowledgment part of my doctoral thesis. And look forward to your next sharing software!

Best regards for you!

jodyphelan commented 2 years ago

Thanks!