Closed davidvanzessen closed 1 year ago
We have run smoothly on TSO500 samples from FASTQ, after aligning to our own pipeline version of the genome available from our resource page.
if you are starting from BAM, you should definitely use the same reference genome that you aligned with. I believe that should work, but I can't guarantee all configurations. If it does not work, I recommend just starting from FASTQ. Mixing ref genomes is a bad idea and can lead to lots of different problems.
Thank you for the quick response, we will replace the reference genome with ours and try the pipeline again.
Hello,
Thought it might be useful to let people know that the proposed solution worked, and share what we did to get it work.
Here is the Conda environment that this was tested in:
name: hmf_tso500
channels:
- conda-forge
- bioconda
dependencies:
- openjdk=11.0.13
- samtools
- bwa=0.7.17
- r-base=4.2.2
- r-dplyr=1.0.10
- bioconductor-copynumber=1.38.0
- gsutil
- picard=2.27.4
- gatk4=4.3.0.0
- gridss=2.13.2
- circos
Not all of this is probably needed, but whatever, install with:
conda/mamba env create -f environment.yml
conda activate hmf_tso500
Symlink the circos script in the conda env to the tools directory:
ln -s /path/to/conda/envs/hmf_tso500/bin/circos tools/circos
Circos was complaining about to many data points, so increase max_points_per_track
in /path/to/conda/envs/molpath_hmf/etc/housekeeping.conf
to some higher number, not sure if this is needed for every sample.
Maybe HMF can comment on this?
Because the names of the regions used in the TSO500 reference are still different (chr1
instead of 1
), just copy the fasta reference file from TSO500_RUO_LocalApp/resources/genomes/<genome>/genome.fa
and then rename the regions:
sed -i 's/>chr/>/g' references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta
And we will create the rest of the files from that:
# dict
picard CreateSequenceDictionary -R references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta -O references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.dict
# fasta.dict
cp references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.dict references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta.dict
# fasta.fai
samtools faidx references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta
# fasta.img
# not sure this is needed if starting from BAM
gatk BwaMemIndexImageCreator -I references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta -O references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta.img
# fasta.ann
# fasta.amb
# fasta.bwt
# fasta.pac
# fasta.sa
bwa index references/ref_genome/Homo_sapiens.GRCh37.GATK.illumina.fasta
Hope this helps someone :)
Hello,
We're trying to run the panel tumor pipeline described here on our Illumina TSO500 data.
But after setting everything up, Amber immediately starts throwing this error when trying to load the sample BAM:
Because the reference used in the official Illumina analysis pipeline is different than the one used in this pipeline, the names are different (which was easily fixed), but also the length of the chrM/MT contig is different.
We thought of a couple of solutions to this, simply aligning to the new reference (slow), removing the mitochondrial reads and modifying the header (could be great), or replacing the reference files in this pipeline (not sure if the pipeline will accept this?).
We've spent some time on trying out modifying the BAM file, but so far tools kept tripping over it and it's something that's usually not recommended...
Since the page specifically describes TSO500 as en example, we were wondering how this was done,
Platinum is mentioned on the page so we assume the FastQ were used as a starting point, skipping the described problems, but double checking if there isn't an existing other solution.
Thank you in advance, if you need any more info let us know.