sanger-tol / variantcalling

This Nextflow DSL2 pipeline calls variants on long read alignment. It is run after sanger-tol/readmapping in the Sanger ToL production suite but with options to run on unaligned reads.
https://pipelines.tol.sanger.ac.uk/variantcalling
MIT License
3 stars 0 forks source link

Define test datasets for population genomics #81

Open muffato opened 2 weeks ago

muffato commented 2 weeks ago

We want multiple test datasets to implement and test the population genomics features. Otter will be one of them, as we can refer to a publication that studies the population structure. We need at least another one, preferably on a smaller genome so that we can test things quicker.

muffato commented 1 week ago

@hangxue-wustl . Here are the data that I've found:

Otter Lutra lutra

Crab apple Malus sylvestris

Fever fly Dilophus febrilis

Tooth fungi, Hericium


In all cases, for the resequencing data, make sure you select:

to exclude Hi-C, RNA-Seq, and the PacBio reads

I haven't checked if those data are already aligned or not.

hangxue-wustl commented 1 week ago

I want to first try running sarek with --step variant_calling, which skips the mapping steps. I am able to download the aligned .cram files for the Otter Lutra lutra. Based the cram header, they used reference file UR:/lustre/scratch120/npg_repository/references/Lutra_lutra/GCA_902655055.2/all/fasta/GCA_902655055.2_mLutLut1.2_genomic.fa. I am not sure which reference file it might be. It sounds like unmasked genome based on the file name.

The following fasta file Lutra_lutra-GCA_902655055.2-chromosomes.tsv.gz is listed in https://ftp.ensembl.org/pub/rapid-release/species/Lutra_lutra/GCA_902655055.2/ensembl/genome. The following fasta file GCA_902655055.2_mLutLut1.2_genomic.fna.gz is listed in https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/655/055/GCA_902655055.2_mLutLut1.2/ The following fasta file GCA_902655055.2.fasta is downloadable from https://www.ebi.ac.uk/ena/browser/view/GCA_902655055.2 (All Seq FASTA)

muffato commented 1 week ago

Hi @hangxue-wustl . Those alignments were done by our Core Sequencing Facilities. It's a standard process that we can request if the species already has a reference genome assembly. I've checked the methods they use and they don't explicitly say whether the assembly is taken masked or unmasked but for human they say they use https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/ which are unmasked. On the other hand the filename is the same as the NCBI FTP https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/655/055/GCA_902655055.2_mLutLut1.2/GCA_902655055.2_mLutLut1.2_genomic.fna.gz where the assemblies are actually masked (with windowmasker). 🤷🏼

Our readmapping pipeline uses the unmasked genome to get all the possible hits, but then the alignments go through samtools view with these options https://github.com/sanger-tol/variantcalling/blob/main/conf/modules.config#L86

to remove non-primary hits. Faculty here told me that we should only be using the primary alignments for variant-calling.