samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Failed to populate reference for id #1627

Closed egnarora closed 1 year ago

egnarora commented 1 year ago

I am trying to index cram files which i downloaded from ebi database. samtools index xxxx.cram but i am getting the same error as issue #1069 i try to use his solution to solve this probleme but i cant index the cram either. i want to know which steps i forgot to set . and how can i index these cram files correctly.

egnarora commented 1 year ago

the error showed ailed to populate reference for id 1799 Failed to populate reference for id 1800 Failed to populate reference for id 1801 Failed to populate reference for id 1802 Failed to populate reference for id 1803 Failed to populate reference for id 1804 Failed to populate reference for id 1805 Failed to populate reference for id 1806 Failed to populate reference for id 1807 Failed to populate reference for id 1808 Failed to populate reference for id 1809 Failed to populate reference for id 1810 Failed to populate reference for id 1810

jkbonfield commented 1 year ago

This means you don't have a REF_PATH set that can find the CRAM references. Take a look at the environment variables listed in the man page here: https://www.htslib.org/doc/samtools.html#ENVIRONMENT_VARIABLES

Do you have the appropriate fasta file? Have you ran the seq_cache_populate.pl script on it as listed in the linked issue?

Ideally every command should be able to use --reference ref.fa too, but it seems to be missing from index. I'm confused however as to why index needs the reference. I've tried it here and I cannot reproduce that. There shouldn't be a need for it.

egnarora commented 1 year ago

I have run the seq_cache_populate.pl script .and i used the GRCh38.fa as the reference fasta . the command as below $/home/sun/Disk16T1/cza/HGDPcram$ ../../../../cbian/tools/samtools-1.17/misc/seq_cache_populate.pl -root=ref hg38.fa Reading hg38.fa and i got "Use environment REF_CACHE=ref/%2s/%2s/%s for accessing these files. See also https://www.htslib.org/workflow/#the-ref_path-and-ref_cache for further information" output,then i $export REF_CACHE=ref/%2s/%2s/%s $samtools index xxxx.cram and it came out "failed to populate reference for id etc"at all is there something wrong for my commands?

jkbonfield commented 1 year ago

You said this CRAM was downloaded from the EBI. Was that ENA or EGA? If it's public data we could check it.

I suspect it's got other references in there. Can you do samtools view on it? What about view -T GRCh38.fa?

I wonder if this is the exact same reference to the one used in the file. The sam headers should give you a clue. Indeed you can do e.g. this to see the 1800th reference name, and then check if that's what you expect and whether it exists with the same md5sum in the fasta file.

samtools view -H xxxx.cram | egrep '^@SQ' | head -1800 | tail -1
egnarora commented 1 year ago

I will give u an sample links"ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/HGDP/data/Uygur/HGDP01305/alignment/HGDP01305.alt_bwamem_GRCh38DH.20181023.Uygur.cram" i checked one of my cram files $samtools view -H HGDP01309.cram | egrep '^@SQ' | head -1800 | tail -1 and it shows @SQ SN:chrUn_JTFH01000957v1_decoy LN:1003 UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa AS:GRCh38 M5:3bf126803f3335d9227e753cb43060dc SP:Human maybe i used wrong reference ? i download "hg38.fa" from uscs, should i use the same versions as GRCh38.fa?

jkbonfield commented 1 year ago

Yes, this is the incorrect reference for this file.

I'd like to think the various organisations learn from the mistakes of hg19 vs GRCh37 which were different in many ways, but apparently not. Looking at the UCSC browser they refer to GCA_000001405.29 as their hg38 reference, and that URL claims it has 996 contigs and 434 patches. The .fai fasta index for my local copy of "GRCh38_full_analysis_set_plus_decoy_hla" has 3366 records.

Your file states where the reference came from, with a full URI, so it's best to populate your MD5 cache from that.

In general, the way this is meant to work is that due to so many variations in what contigs are in which reference, or even disagreements on the most basic of things such as "10" vs "chr10", the sequence is identified by it's md5 checksum. This is then used to fetch the sequence - ideally from a local cache (what seq_cache_populate generates), or failing that directly from the EBI reference server. This is enabled by default, but the fact it's not working is probably a firewall blocking direct internet access.

egnarora commented 1 year ago

Thanks a lot for your answer. so i download the new reference fasta file . and if i run the seq_cache_populate.pl script with "GRCh38_full_analysis_set_plus_decoy_hla" it will be correct or not? or should i set the other options to make the reference right. yours sincerely

egnarora commented 1 year ago

so i noticed that there were indexed crai file where my file states came from , so i think these indexed file i can use it directly without any manage. is that correct? i have built a index for hg38 , should i built another index for GRCh38_full...?

jkbonfield commented 1 year ago

If you have a file remotely, such as ftp://some_host/some_file.cram, then you should be able to view that directly within samtools and it'd download and use the reference associated with it too. If you have no internet access (which I suspect given the apparent inability to connect to the EBI reference server) then you'd have to manually download the .cram.crai file yourself and place it adjacent to the .cram.

I do not understand however why a reference is needed for indexing a cram, and I couldn't reproduce this problem locally. However avoiding reindexing is easy obviously.

For using samtools view and other similar subcommands you have two choices. Either specify the reference with eg --reference ref.fa or sometimes view -T ref.fa, or setting REF_CACHE environment variable as you've done above and using seq_cache_populate.pl. You don't need to remove other existing content there - it's simply additive and will accumulate a cache of reference sequences.