Closed muffato closed 2 months ago
@hangxue-wustl . Here are the data that I've found:
In all cases, for the resequencing data, make sure you select:
to exclude Hi-C, RNA-Seq, and the PacBio reads
I haven't checked if those data are already aligned or not.
I want to first try running sarek with --step variant_calling, which skips the mapping steps. I am able to download the aligned .cram files for the Otter Lutra lutra. Based the cram header, they used reference file UR:/lustre/scratch120/npg_repository/references/Lutra_lutra/GCA_902655055.2/all/fasta/GCA_902655055.2_mLutLut1.2_genomic.fa. I am not sure which reference file it might be. It sounds like unmasked genome based on the file name.
The following fasta file Lutra_lutra-GCA_902655055.2-chromosomes.tsv.gz is listed in https://ftp.ensembl.org/pub/rapid-release/species/Lutra_lutra/GCA_902655055.2/ensembl/genome. The following fasta file GCA_902655055.2_mLutLut1.2_genomic.fna.gz is listed in https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/655/055/GCA_902655055.2_mLutLut1.2/ The following fasta file GCA_902655055.2.fasta is downloadable from https://www.ebi.ac.uk/ena/browser/view/GCA_902655055.2 (All Seq FASTA)
Hi @hangxue-wustl . Those alignments were done by our Core Sequencing Facilities. It's a standard process that we can request if the species already has a reference genome assembly. I've checked the methods they use and they don't explicitly say whether the assembly is taken masked or unmasked but for human they say they use https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/ which are unmasked. On the other hand the filename is the same as the NCBI FTP https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/655/055/GCA_902655055.2_mLutLut1.2/GCA_902655055.2_mLutLut1.2_genomic.fna.gz where the assemblies are actually masked (with windowmasker). 🤷🏼
Our readmapping pipeline uses the unmasked genome to get all the possible hits, but then the alignments go through samtools view
with these options https://github.com/sanger-tol/variantcalling/blob/main/conf/modules.config#L86
to remove non-primary hits. Faculty here told me that we should only be using the primary alignments for variant-calling.
We want multiple test datasets to implement and test the population genomics features. Otter will be one of them, as we can refer to a publication that studies the population structure. We need at least another one, preferably on a smaller genome so that we can test things quicker.