nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
410 stars 417 forks source link

error when use custom reference #1649

Open ybdong919 opened 2 months ago

ybdong919 commented 2 months ago

Description of the bug

When I use my custom reference, error always show: This path is not available within annotation-cache. Please check https://annotation-cache.github.io/ to create a request for it.

My command is : nextflow run ./sarek -profile singularity --input samplesheet.csv --outdir ./ --tools 'freebayes,snpeff' --genome null --igenomes_ignore --fasta ./ref/hs37d5.fa.gz --skip_tools baserecalibrator

The log:

N E X T F L O W ~ version 24.04.4

Launching ./sarek/main.nf [distraught_edison] DSL2 - revision: e3d6110e17

WARN: Access to undefined parameter monochromeLogs -- Initialise it to a default value eg. params.monochromeLogs = some_value


                                    ,--./,-.
    ___     __   __   __   ___     /,-._.--~'

|\ | | / / \ |__) |__ } { | \| | \__, \__/ | \ |___ \-.,--, .,._,'


.´ _  `.

/ |`-_ \ _
| | \ -| |__ /\ |) | |/ \ | \ / .| /¯¯\ | \ |_
| \ `|____\´

nf-core/sarek v3.4.4 ....

[- ] NFC…EPARE_GENOME:BWAMEM1_INDEX - [- ] NFC…EPARE_GENOME:BWAMEM2_INDEX - [- ] NFC…E_GENOME:DRAGMAP_HASHTABLE - [- ] NFC…4_CREATESEQUENCEDICTIONARY - [- ] NFC…E_GENOME:MSISENSORPRO_SCAN - [- ] NFC…PARE_GENOME:SAMTOOLS_FAIDX - [- ] NFC…TABIX_BCFTOOLS_ANNOTATIONS - [- ] NFC…PREPARE_GENOME:TABIX_DBSNP - [- ] NFC…ME:TABIX_GERMLINE_RESOURCE - [- ] NFC…RE_GENOME:TABIX_KNOWN_SNPS - [- ] NFC…_GENOME:TABIX_KNOWN_INDELS - [- ] NFC…K:PREPARE_GENOME:TABIX_PON - [- ] NFC…_INTERVALS:BUILD_INTERVALS - [- ] NFC…RVALS:CREATE_INTERVALS_BED - [- ] NFC…_BGZIPTABIX_INTERVAL_SPLIT - [- ] NFC…ZIPTABIX_INTERVAL_COMBINED - This path is not available within annotation-cache. Please check https://annotation-cache.github.io/ to create a request for it.

Command used and terminal output

$nextflow run ./sarek -profile singularity --input samplesheet.csv --outdir ./ --tools 'freebayes,snpeff' --genome null --igenomes_ignore --fasta ./ref/hs37d5.fa.gz --skip_tools baserecalibrator

terminal output:

N E X T F L O W   ~  version 24.04.4

Launching `./sarek/main.nf` [distraught_edison] DSL2 - revision: e3d6110e17

WARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`

------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
      ____
    .´ _  `.
   /  |\`-_ \      __        __   ___     
  |   | \  `-|    |__`  /\  |__) |__  |__/
   \ |   \  /     .__| /¯¯\ |  \ |___ |  \
    `|____\´

  nf-core/sarek v3.4.4
....
* Software dependencies
  https://github.com/nf-core/sarek/blob/master/CITATIONS.md
------------------------------------------------------
[-        ] NFC…EPARE_GENOME:BWAMEM1_INDEX -
[-        ] NFC…EPARE_GENOME:BWAMEM2_INDEX -
[-        ] NFC…E_GENOME:DRAGMAP_HASHTABLE -
[-        ] NFC…4_CREATESEQUENCEDICTIONARY -
[-        ] NFC…E_GENOME:MSISENSORPRO_SCAN -
[-        ] NFC…PARE_GENOME:SAMTOOLS_FAIDX -

[-        ] NFC…EPARE_GENOME:BWAMEM1_INDEX -
[-        ] NFC…EPARE_GENOME:BWAMEM2_INDEX -
[-        ] NFC…E_GENOME:DRAGMAP_HASHTABLE -
[-        ] NFC…4_CREATESEQUENCEDICTIONARY -
[-        ] NFC…E_GENOME:MSISENSORPRO_SCAN -
[-        ] NFC…PARE_GENOME:SAMTOOLS_FAIDX -
[-        ] NFC…TABIX_BCFTOOLS_ANNOTATIONS -
[-        ] NFC…PREPARE_GENOME:TABIX_DBSNP -
[-        ] NFC…ME:TABIX_GERMLINE_RESOURCE -
[-        ] NFC…RE_GENOME:TABIX_KNOWN_SNPS -
[-        ] NFC…_GENOME:TABIX_KNOWN_INDELS -
[-        ] NFC…K:PREPARE_GENOME:TABIX_PON -
[-        ] NFC…_INTERVALS:BUILD_INTERVALS -
[-        ] NFC…RVALS:CREATE_INTERVALS_BED -
[-        ] NFC…_BGZIPTABIX_INTERVAL_SPLIT -
[-        ] NFC…ZIPTABIX_INTERVAL_COMBINED -
This path is not available within annotation-cache.
Please check https://annotation-cache.github.io/ to create a request for it.

Relevant files

No response

System information

No response

asp8200 commented 2 months ago

I was able to reproduce the error.

The error is due to the fact that you've got --genome null --igenomes_ignore. Then, AFAICT, --snpeff_genome and --snpeff_db no longer get set through the igenomes.config-file. (If that is indeed the case, then I think Sarek should issue a more informative error msg.)

Could you try adding --snpeff_genome GRCh38 --snpeff_db 105 or whichever version of snpeff you want to use in your NF command?

If you can't find any info on this in the docs for Sarek, then we might have to add some info there.

ybdong919 commented 2 months ago

How can I check/list all snpeff db or genome?

maxulysse commented 2 months ago

I'd check the https://pcingola.github.io/SnpEff/ and https://www.ensembl.org/info/docs/tools/vep/index.html website for it, they have tons of genomes and lots of different versions. We also mirror some of them in https://annotation-cache.github.io/

ybdong919 commented 2 months ago

Why only chr21 is analyzed by freebayes? When I checked the vcf generated by freebayes, I found only chr21 was analyzed, and the line "##commandline="freebayes -f genome.fa --target chr21_1-46709983.bed --min-alternate-fraction 0.1 --min-mapping-quality 1 S1.md.cram" in vcf. Does freebayes only analyze chr21 by default in Sarek? How to let it analyze all chrs?

asp8200 commented 2 months ago

Why only chr21 is analyzed by freebayes? When I checked the vcf generated by freebayes, I found only chr21 was analyzed, and the line "##commandline="freebayes -f genome.fa --target chr21_1-46709983.bed --min-alternate-fraction 0.1 --min-mapping-quality 1 S1.md.cram" in vcf. Does freebayes only analyze chr21 by default in Sarek? How to let it analyze all chrs?

I had a look at the freebayes-vcf here

s3://nf-core-awsmegatests/sarek/results-5cc30494a6b8e7e53be64d308b582190ca7d2585/test_full_germline_aws/variant_calling/freebayes/NA12878/NA12878.freebayes.vcf.gz

which is from test_full_germline executed on Sarek v3.4.4 over awsbatch.

The freebayes-vcf contains one ##commandline tagged line, and it is the following:

##commandline="freebayes -f Homo_sapiens_assembly38.fasta --target chr6_95070791-167591393.bed --min-alternate-fraction 0.1 --min-mapping-quality 1 NA12878.recal.cram" 

The pipeline runs freebayes for a bunch of intervals, and the resulting vcf-files then gets merged by the following command:

gatk --java-options "-Xmx3276M -XX:-UsePerfData" \
    MergeVcfs \
    --INPUT NA12878.chrY_9055175-9057608.gz.sort.vcf.gz --INPUT NA12878.chr12_37235253-37240944.gz.sort.vcf.gz --INPUT NA12878.chr6_95070791-167591393.gz.sort.vcf.gz --INPUT NA12878.chr13_86252980-111703855.gz.sort.vcf.gz --INPUT NA12878.chrX_37285838-49348394.gz.sort.vcf.gz --INPUT NA12878.chr18_47019913-54536574.gz.sort.vcf.gz --INPUT NA12878.chr9_41229379-41237752.gz.sort.vcf.gz --INPUT NA12878.chr2_238904048-242183529.gz.sort.vcf.gz --INPUT NA12878.chr10_39590436-39593013.gz.sort.vcf.gz --INPUT NA12878.chr11_51078349-54425074.gz.sort.vcf.gz --INPUT NA12878.chr1_10001-207666.gz.sort.vcf.gz --INPUT NA12878.chr2_16146120-32867130.gz.sort.vcf.gz --INPUT NA12878.chr4_10001-1429358.gz.sort.vcf.gz --INPUT NA12878.chr17_60001-448188.gz.sort.vcf.gz --INPUT NA12878.chr5_139453660-155760324.gz.sort.vcf.gz --INPUT NA12878.chr20_36314720-64334167.gz.sort.vcf.gz --INPUT NA12878.chr8_44033745-45877265.gz.sort.vcf.gz --INPUT NA12878.chr1_122026460-124977944.gz.sort.vcf.gz --INPUT NA12878.chr4_190173122-190204555.gz.sort.vcf.gz --INPUT NA12878.chr15_20729747-21193490.gz.sort.vcf.gz --INPUT NA12878.chr7_58169654-60828234.gz.sort.vcf.gz \
    --OUTPUT NA12878.freebayes.vcf.gz \
    --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict \
    --TMP_DIR . \

The merged vcf-file NA12878.freebayes.vcf.gz only contains one ##commandline tagged line, and it is the one mentioned above, but still the merged vcf-file contains variants from all the chromosomes, so I guess MergeVcfs just includes the ##commandline from one of the input vcf-files.

Does your published vcf-file from freebayes only contain variants from within the region chr21:1-46709983?

ybdong919 commented 2 months ago

I'd check the https://pcingola.github.io/SnpEff/ and https://www.ensembl.org/info/docs/tools/vep/index.html website for it, they have tons of genomes and lots of different versions. We also mirror some of them in https://annotation-cache.github.io/

Would you give me more detialed information about where to find a list of genomes?

ybdong919 commented 2 months ago

Why only chr21 is analyzed by freebayes? When I checked the vcf generated by freebayes, I found only chr21 was analyzed, and the line "##commandline="freebayes -f genome.fa --target chr21_1-46709983.bed --min-alternate-fraction 0.1 --min-mapping-quality 1 S1.md.cram" in vcf. Does freebayes only analyze chr21 by default in Sarek? How to let it analyze all chrs?

I had a look at the freebayes-vcf here

s3://nf-core-awsmegatests/sarek/results-5cc30494a6b8e7e53be64d308b582190ca7d2585/test_full_germline_aws/variant_calling/freebayes/NA12878/NA12878.freebayes.vcf.gz

which is from test_full_germline executed on Sarek v3.4.4 over awsbatch.

The freebayes-vcf contains one ##commandline tagged line, and it is the following:

##commandline="freebayes -f Homo_sapiens_assembly38.fasta --target chr6_95070791-167591393.bed --min-alternate-fraction 0.1 --min-mapping-quality 1 NA12878.recal.cram" 

The pipeline runs freebayes for a bunch of intervals, and the resulting vcf-files then gets merged by the following command:

gatk --java-options "-Xmx3276M -XX:-UsePerfData" \
    MergeVcfs \
    --INPUT NA12878.chrY_9055175-9057608.gz.sort.vcf.gz --INPUT NA12878.chr12_37235253-37240944.gz.sort.vcf.gz --INPUT NA12878.chr6_95070791-167591393.gz.sort.vcf.gz --INPUT NA12878.chr13_86252980-111703855.gz.sort.vcf.gz --INPUT NA12878.chrX_37285838-49348394.gz.sort.vcf.gz --INPUT NA12878.chr18_47019913-54536574.gz.sort.vcf.gz --INPUT NA12878.chr9_41229379-41237752.gz.sort.vcf.gz --INPUT NA12878.chr2_238904048-242183529.gz.sort.vcf.gz --INPUT NA12878.chr10_39590436-39593013.gz.sort.vcf.gz --INPUT NA12878.chr11_51078349-54425074.gz.sort.vcf.gz --INPUT NA12878.chr1_10001-207666.gz.sort.vcf.gz --INPUT NA12878.chr2_16146120-32867130.gz.sort.vcf.gz --INPUT NA12878.chr4_10001-1429358.gz.sort.vcf.gz --INPUT NA12878.chr17_60001-448188.gz.sort.vcf.gz --INPUT NA12878.chr5_139453660-155760324.gz.sort.vcf.gz --INPUT NA12878.chr20_36314720-64334167.gz.sort.vcf.gz --INPUT NA12878.chr8_44033745-45877265.gz.sort.vcf.gz --INPUT NA12878.chr1_122026460-124977944.gz.sort.vcf.gz --INPUT NA12878.chr4_190173122-190204555.gz.sort.vcf.gz --INPUT NA12878.chr15_20729747-21193490.gz.sort.vcf.gz --INPUT NA12878.chr7_58169654-60828234.gz.sort.vcf.gz \
    --OUTPUT NA12878.freebayes.vcf.gz \
    --SEQUENCE_DICTIONARY Homo_sapiens_assembly38.dict \
    --TMP_DIR . \

The merged vcf-file NA12878.freebayes.vcf.gz only contains one ##commandline tagged line, and it is the one mentioned above, but still the merged vcf-file contains variants from all the chromosomes, so I guess MergeVcfs just includes the ##commandline from one of the input vcf-files.

Does your published vcf-file from freebayes only contain variants from within the region chr21:1-46709983?

Yes, only chr21:1-46709983

asp8200 commented 2 months ago

Yes, only chr21:1-46709983

Could you paste the contains of .command.sh for the MergeVcfs-job for FREEBAYES here?