samtools / htslib

C library for high-throughput sequencing data formats
Other
799 stars 445 forks source link

Trouble converting .cram to .fq #1515

Closed 0106WeiWeiDeng closed 1 year ago

0106WeiWeiDeng commented 1 year ago

hello,

I download some .cram files from ENA (https://www.ebi.ac.uk/ena/browser/view/PRJEB42969?show=reads), and the human ref genome file(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/ phase2_reference_assembly_sequence/hs37d5.fa.gz). I first want to convert .cram file to .bam file, and then .fq file. I got errors when I running the following cmd:

samtools view -b -T ~/hs37d5.fa -@ 8 -o RADS27LK_DNA.bam ../bulk_DNA_WES/RADS27LK_DNA.cram

and error info is

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:822981-1026969

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:12979-823012

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:1026877-1269314

[E::cram_next_slice] Slice decode failure samtools view: error reading file "../bulk_DNA_WES/RADS27LK_DNA.cram" [W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:1269215-1452587

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:1452511-1653336

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:1653238-1902042

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:2487960-3124516

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:3124629-3552717

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:1901952-2235974

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:2235878-2488058

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:3552623-3750770

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:3750688-4254076

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:5236370-5748914

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:5748849-6133821

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:4253995-4803687

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:6579376-6702774

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:4803613-5236461

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:6133722-6311715

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:6311616-6579475

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:6702723-6961376

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:7822405-7902854

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:7440071-7822462

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:6961374-7440105

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:7902755-8057250

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:8370582-8617569

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:8057335-8370672

[W::find_file_url] Failed to open reference "https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128": Protocol not supported [E::fai_build3_core] Failed to open the file /data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa [E::refs_load_fai] Failed to open reference file '/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa' [E::cram_get_ref] Failed to populate reference for id 0 [E::cram_decode_slice] Unable to fetch reference #0:8617470-8853078

samtools version info:

samtools 1.13 Using htslib 1.13+ds Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details: Features: build=configure curses=yes CC: gcc CPPFLAGS: -frelease -Wdate-time -D_FORTIFY_SOURCE=2 CFLAGS: -g -O2 -ffile-prefix-map=▒BUILDPATH▒=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security LDFLAGS: -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now HTSDIR: LIBS: CURSES_LIB: -lcurses

HTSlib compilation details: Features: build=configure plugins=yes, plugin-path=/usr/local/lib/htslib:/usr/local/libexec/htslib:: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1 CC: gcc CPPFLAGS: -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2 CFLAGS: -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden LDFLAGS: -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden

HTSlib URL scheme handlers present: built-in: preload, data, file crypt4gh-needed: crypt4gh mem: mem

System info: Ubuntu 22.04 LTS

jkbonfield commented 1 year ago

The cause of your error appears to be a build issue. I don't recognise the -ffile-prefix-map options, so I'm guessing this is some binary install from a distribution somewhere rather than a manual user driven build. It looks like htslib has been compiled without finding a working copy of curl, so it has no access to fetching references via https. That'll be where the "Protocol not supported" line comes from.

A standard htslib build would claim:

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    S3 Multipart Upload:     s3w, s3w+https, s3w+http
    Amazon S3:   s3+https, s3+http, s3
    Google Cloud Storage:    gs+http, gs+https, gs
    libcurl:     imaps, pop3, http, smb, gopher, ftps, imap, smtp, smtps, rtsp, ftp, telnet, rtmp, ldap, https, ldaps, tftp, pop3s, smbs, dict
    crypt4gh-needed:     crypt4gh
    mem:     mem

Meanwhile, this looks like GRCh37 reference, so you can probably download a copy to use locally (eg one of https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/). Edit: sorry I see you already tried that. I'm not sure why it can't read it. Maybe you need to run samtools faidx on it first, although I thought htslib did that automatically.

0106WeiWeiDeng commented 1 year ago

Thanks for your quick reply! I installed samtools using Ubuntu's apt, I will try it again by building samtools from source.

jkbonfield commented 1 year ago

Discussing this over our coffee break, we realised the most likely cause is you have a HTS_PATH environment variable pointing to /usr/local/lib/htslib:/usr/local/libexec/htslib::. This ought to permit it to also search the default system directories, but it doesn't appear to be doing so.

If you unset HTS_PATH and run samtools version again does it then start listing the curl (etc) plugins?

daviesrob commented 1 year ago

The problem finding plug-ins looks like a Debian/Ubuntu packaging problem. It's surprising that no-one has reported it before.

Trying on a fresh Ubuntu jammy install I got:

ubuntu@jammy-test:~$ samtools version
samtools 1.13
Using htslib 1.13+ds
Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       -frelease  -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=?BUILDPATH?=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now
    HTSDIR:         
    LIBS:           
    CURSES_LIB:     -lcurses

HTSlib compilation details:
    Features:       build=configure plugins=yes, plugin-path=/usr/local/lib/htslib:/usr/local/libexec/htslib:: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1
    CC:             gcc
    CPPFLAGS:       -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    crypt4gh-needed:     crypt4gh
    mem:     mem

Using HTS_PATH=/usr/lib/x86_64-linux-gnu/htslib to show where the plug-ins are fixes it:

ubuntu@jammy-test:~$ HTS_PATH=/usr/lib/x86_64-linux-gnu/htslib samtools version
samtools 1.13
Using htslib 1.13+ds
Copyright (C) 2021 Genome Research Ltd.

Samtools compilation details:
    Features:       build=configure curses=yes 
    CC:             gcc
    CPPFLAGS:       -frelease  -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=?BUILDPATH?=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now
    HTSDIR:         
    LIBS:           
    CURSES_LIB:     -lcurses

HTSlib compilation details:
    Features:       build=configure plugins=yes, plugin-path=/usr/lib/x86_64-linux-gnu/htslib: libcurl=yes S3=yes GCS=yes libdeflate=yes lzma=yes bzip2=yes htscodecs=1.1.1
    CC:             gcc
    CPPFLAGS:       -I. -DSAMTOOLS=1 -Wdate-time -D_FORTIFY_SOURCE=2
    CFLAGS:         -g -O2 -ffile-prefix-map=/build/htslib-TQtOKr/htslib-1.13+ds=. -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -flto -fvisibility=hidden -flto -fvisibility=hidden
    LDFLAGS:        -Wl,-Bsymbolic-functions -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,-flto -fvisibility=hidden -Wl,-flto -fvisibility=hidden

HTSlib URL scheme handlers present:
    built-in:    preload, data, file
    S3 Multipart Upload:     s3w, s3w+https, s3w+http
    Google Cloud Storage:    gs+http, gs+https, gs
    Amazon S3:   s3+https, s3+http, s3
    libcurl:     imaps, pop3, gophers, http, smb, gopher, sftp, ftps, imap, smtp, smtps, rtsp, scp, ftp, telnet, mqtt, rtmp, ldap, https, ldaps, smbs, tftp, pop3s, dict
    crypt4gh-needed:     crypt4gh
    mem:     mem

It looks like the same problem may exist in the current Debian htslib package so I guess we should report the problem there.

jkbonfield commented 1 year ago

The build instructions in the Debian repo are here: https://salsa.debian.org/med-team/htslib/-/blob/master/debian/rules

The dh_auto_configure -- --enable-libcurl --enable-gcs --enable-s3 --enable-plugins --with-plugin-dir='$(libdir)/htslib' --with-external-htscodecs --with-plugin-path='/usr/local/lib/htslib:/usr/local/libexec/htslib:$(plugindir)' line indicates that likely $(plugindir) is unset.

They should just replace it with $(libdir)/htslib.

daviesrob commented 1 year ago

$(plugindir) is a Makefile variable that should have been replaced with the value given for --with-plugin-dir. It works if you do ./configure --with-plugin-path='/usr/local/lib/htslib:/usr/local/libexec/htslib:$(plugindir)', but Debian aren't doing that directly. I suspect a loss of quoting between dh_auto_configure and configure.

0106WeiWeiDeng commented 1 year ago

I build samtools 1.16.1 from source and run the cmd: samtools view -b -T ~/hs37d5.fa -o RADS27LK_DNA.bam ../bulk_DNA_WES/RADS27LK_DNA.cram, now it did not print the errors, but it stays at 'runing' for a long time(12+ hours now ). I use htop to view the state of samtools cmd, and found that it stays in s. It seems like samtools get stuck at some point. The samtools cmd output info is :

[W::cram_populate_ref] Creating reference cache directory /home/deng/.cache/hts-ref
This may become large; see the samtools(1) manual page REF_CACHE discussion
jkbonfield commented 1 year ago

Firstly, the hang is probably simply a networking problem. Either local bandwidth is being throttled, making fetching the remote reference painfully slow, or the EBI are serving data up at an extremely slow rate. You'd probably need to use strace to really diagnose where it's spending the time.

As for why it's even going offsite to fetch a reference, this is sadly a classic issue of bad data provenance by the data submitters.

$ samtools view -H ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR528/ERR5285421/RADS27LK_DNA.cram | grep '@SQ' | head -1
@SQ SN:chr1 LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128 UR:/data1/symec/RESOURCES/refseq/NCBI37_DECOY.fa

Note the chromosomes are named "chr1", not "1". Hence this is not the 1000 genomes hs37d5.fa file, and that is why samtools is attempting to fetch your reference from a remote resource (via the md5sum), because it cannot find it in the specified file. This is perhaps a situation we should report as a warning rather than simply trying harder to solve the mistake.

As for what the actual reference is, that's challenging. Seriously I wish the NCBI and ENA archives simply point blank refused to accept submissions that don't tie back to a known registered assembly file. It makes life so hard!

A bit of data sleuthing shows this has 86 references. The length of MT is 16569, not 16571 (and MT not M), so it's not hgc19 (as that was one difference between that and grch37 as I recall). Armed with that, I produced an md5sum of all md5sums and compared that to a bunch of known local references and it does indeed match my hs37d5.fa. It's just that someone for some baffling reason thought it'd be a good idea to rename all the chromosomes! (https://www.edvardmunch.org/the-scream.jsp)

So from here you have a couple of choices.

  1. Produce a new local copy of hs37d5.fa with "chr" added to all refs. This is probably the easiest strategy.
sed 's/^>/>chr/' hs37d5.fa > NCBI37_DECOY.fa
  1. Produce an MD5sum directory structure so CRAM can use that for direct mmaps of references and avoid parsing the fasta files (untested, but something like this):
$ samtools/misc/seq_cache_populate.pl -root ~/.cache/hts-ref hs37d5.fa
$ export REF_CACHE=~/.cache/hts-ref/%2s/%2s/%s

The REF_CACHE setting here should be the default one so isn't needed if you use that location, but if you take this approach you may wish to change directory to somewhere more appropriate. I chose that as it's simply what htslib is trying to automatically locally populate for you by downloading the references from the EBI.

jkbonfield commented 1 year ago

I can confirm the glacial rate is due to the EBI's reference server. It was estimating 30minutes to download Human chr1. :-(

I'd strongly recommend building your own local directory of pre-cached md5sums per sequence. Although it's more effort to maintain, it makes you immune to the vagarities of people who like to rename chromosomes for giggles, as the MD5sum becomes the access key instead, it affords deduplication when we get differing combinations of references with most of the major ones being the same (GRCh37 vs h37d5), and it's also faster as there's no fasta format parsing necessary.

0106WeiWeiDeng commented 1 year ago

Oh, I see, the real problem is the submiter did not use the standard human ref genome to generate cram files! "It make life so hard"! Thanks a lot for your help!