Define test datasets for population genomics

muffato commented 5 months ago

We want multiple test datasets to implement and test the population genomics features. Otter will be one of them, as we can refer to a publication that studies the population structure. We need at least another one, preferably on a smaller genome so that we can test things quicker.

[x] Find all the otter data on ENA / NCBI
[x] Find smaller / more diverse datasets

muffato commented 5 months ago

@hangxue-wustl . Here are the data that I've found:

Otter Lutra lutra (2.4 Gbp)

Assembly: https://www.ebi.ac.uk/ena/browser/view/GCA_902655055.2
Resequencing data: https://www.ebi.ac.uk/ena/browser/view/PRJEB43634

Crab apple Malus sylvestris (640 Mbp)

Assembly: https://www.ebi.ac.uk/ena/browser/view/GCA_916048215
Resequencing data: https://www.ebi.ac.uk/ena/browser/view/PRJEB47302 (select instrument_platform == "ILLUMINA" and library_strategy == "WGS")

Fever fly Dilophus febrilis (305 Mbp)

Assembly: https://www.ebi.ac.uk/ena/browser/view/GCA_958336335.1
Resequencing data: https://www.ebi.ac.uk/ena/browser/view/PRJEB53144

Tooth fungi, Hericium

Resequencing data: https://www.ebi.ac.uk/ena/browser/view/PRJEB52686
Assembly: multiple ones, depending on the species, but not from us: https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=40459

In all cases, for the resequencing data, make sure you select:

Library Strategy = WGS
Library Source = GENOMIC
Library Selection = RANDOM
Instrument Platform = ILLULIMA

to exclude Hi-C, RNA-Seq, and the PacBio reads

I haven't checked if those data are already aligned or not.

hangxue-wustl commented 5 months ago

I want to first try running sarek with --step variant_calling, which skips the mapping steps. I am able to download the aligned .cram files for the Otter Lutra lutra. Based the cram header, they used reference file UR:/lustre/scratch120/npg_repository/references/Lutra_lutra/GCA_902655055.2/all/fasta/GCA_902655055.2_mLutLut1.2_genomic.fa. I am not sure which reference file it might be. It sounds like unmasked genome based on the file name.

The following fasta file Lutra_lutra-GCA_902655055.2-chromosomes.tsv.gz is listed in https://ftp.ensembl.org/pub/rapid-release/species/Lutra_lutra/GCA_902655055.2/ensembl/genome. The following fasta file GCA_902655055.2_mLutLut1.2_genomic.fna.gz is listed in https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/655/055/GCA_902655055.2_mLutLut1.2/ The following fasta file GCA_902655055.2.fasta is downloadable from https://www.ebi.ac.uk/ena/browser/view/GCA_902655055.2 (All Seq FASTA)

muffato commented 5 months ago

Hi @hangxue-wustl . Those alignments were done by our Core Sequencing Facilities. It's a standard process that we can request if the species already has a reference genome assembly. I've checked the methods they use and they don't explicitly say whether the assembly is taken masked or unmasked but for human they say they use https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/ which are unmasked. On the other hand the filename is the same as the NCBI FTP https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/655/055/GCA_902655055.2_mLutLut1.2/GCA_902655055.2_mLutLut1.2_genomic.fna.gz where the assemblies are actually masked (with windowmasker). 🤷🏼

Our readmapping pipeline uses the unmasked genome to get all the possible hits, but then the alignments go through samtools view with these options https://github.com/sanger-tol/variantcalling/blob/main/conf/modules.config#L86

to remove non-primary hits. Faculty here told me that we should only be using the primary alignments for variant-calling.

sanger-tol / variantcalling