nf-core / sarek

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing
https://nf-co.re/sarek
MIT License
399 stars 404 forks source link

WES files provided by ascat author #1526

Open wlyucl opened 5 months ago

wlyucl commented 5 months ago

Description of the bug

Hi Developers,

I'm trying to run the Sarek implemented ASCAT for CNV analysis on WES data. On the nfcore Sarek website, it's suggested to follow 5 steps, as specified in this doc https://nf-co.re/sarek/3.4.0/docs/usage#how-to-generate-ascat-resources-for-exome-or-targeted-sequencing, to generate reference information (allele.zip, loci.zip, GC.zip, and RT.zip) for exome data instead of using the default igenome directly. I noticed that the ASCAT author had also provided ref files for WES at https://github.com/VanLoo-lab/ascat/tree/master/ReferenceFiles/WES, which seemed to be a ready-to-use version when provided with an appropriate BED file. Would it be feasible to replace the default ignome ref with those for Sarek ASCAT?

I'm now running Sarek with params (-- --ascat_alleles, --ascat_loci, --ascat_loci_gc, --ascat_loci_rt) on the command line. The pipeline seems to work well. But, it would be great to hear advice from you.

Thank you!

Command used and terminal output

No response

Relevant files

No response

System information

No response

FriederikeHanssen commented 3 months ago

Hi! By default we supply the WGS files, but you should be able to fetch the files you want and supply them easily via the command line. THank you for flagging the updated files that are available, we can reflect this in our docs as well and link to it.

lauren-tjoeka commented 3 months ago

looking forward to the update! I'm following this documentation and in point 3 I think I've encountered a typo in the 'awk' command right after 'do':

cd battenberg_loci_on_target_hg38/ rm *chrstring* rm 1kg.phase3.v5a_GRCh38nounref_loci_chr23.txt for i in {1..22} X do

awk '{ print $1 "\t" $2-1 "\t" $2 }' 1kg.phase3.v5a_GRCh38nounref_loci_chr${i}.txt > chr${i}.bed #awk '{ print "chr" $1 "\t" $2-1 "\t" $2 }' 1kg.phase3.v5a_GRCh38nounref_loci_chr${i}.txt > chr${i}.bed

grep "^${i}_" GC_G1000_on_target_hg38.txt | awk '{ print "chr" $1 }' > chr${i}.txt bedtools intersect -a chr${i}.bed -b targets_with_chr.bed | awk '{ print $1 "_" $3 }' > chr${i}_on_target.txt

n=wc -l chr${i}_on_target.txt | awk '{ print $1 }'

count=$((n * 3 / 10)) grep -xf chr${i}.txt chr${i}_on_target.txt > chr${i}.temp shuf -n $count chr${i}_on_target.txt >> chr${i}.temp sort -n -k2 -t '_' chr${i}.temp | uniq | awk 'BEGIN { FS="_" } ; { print $1 "\t" $2 }' > battenberg_loci_on_target_hg38_chr${i}.txt done

zip battenberg_loci_on_target_hg38.zip battenberg_loci_on_target_hg38_chr*.txt

I could only get the for loop to run when I used the line I've commented out instead that contains "chr". Is this expected behaviour? I'm using hg19 references

Many thanks!