jkbonfield / io_lib

Staden Package "io_lib" (sometimes referred to as libstaden-read by distributions). This contains code for reading and writing a variety of Bioinformatics / DNA Sequence formats.
Other
36 stars 15 forks source link

Use of REF_PATH #10

Closed keiranmraine closed 6 years ago

keiranmraine commented 6 years ago

Hi,

I'm indirectly using io_lib via biobambam2 so this is primarily an attempt to isolate where the problem may lie. My understanding is that CRAM conversion is handled by io_lib.

I'm finding that using REF_PATH isn't working as expected.

When I run bamtofastq (as we need to split by readgroup too) the job fails:

$ REF_PATH='URL=http:://www.ebi.ac.uk/ena/cram/md5/%s' REF_CACHE=$PWD/wibble/to_split/hts-ref-cache/%2s/%2s/%s bamtofastq gz=1 exclude=SECONDARY,SUPPLEMENTARY tryoq=1 outputperreadgroup=1 outputperreadgroupprefix=colo-829 outputperreadgroupsuffixF=_1.fq.gz outputperreadgroupsuffixF2=_2.fq.gz outputperreadgroupsuffixO=_o1.fq.gz outputperreadgroupsuffixO2=_o2.fq.gz outputperreadgroupsuffixS=_s.fq.gz inputformat=cram filename=wibble/to_split/colo-829.cram outputdir=wibble/
HTTP/1.1 301 Moved Permanently
Content-Type: text/html
Date: Wed, 10 Oct 2018 11:50:03 GMT
Location: https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128
Connection: Keep-Alive
Content-Length: 0

HTTP/1.1 301 Moved Permanently
Content-Type: text/html
Date: Wed, 10 Oct 2018 12:05:21 GMT
Location: https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128.gz
Connection: Keep-Alive
Content-Length: 0

... for bz2, sz, Z, bz2 (oddly a second time) ...

Failed to populate reference for id 0
Unable to fetch reference #0 13042..801529
Failure to decode slice
ScramDecoder::readAlignment(): failed to read alignment without reaching EOF

/home/kr2/local/biobambam2/bin/../lib/libmaus2.so.2(libmaus2::util::StackTrace::StackTrace()+0x4c)[0x7f6466520e3c]
bamtofastq(libmaus2::exception::LibMausException::LibMausException()+0x20)[0x41a5d0]
bamtofastq()[0x441728]
bamtofastq()[0x46ef1b]
bamtofastq()[0x47652b]
bamtofastq()[0x4160f6]
bamtofastq()[0x416b3e]
bamtofastq()[0x411f2a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f6462e1eb97]
bamtofastq()[0x412732]

If I first run samtools view, thus pre-populating the REF_CACHE then bamtofastq successfully completes:

$ REF_PATH='http://www.ebi.ac.uk/ena/cram/md5/%s' REF_CACHE=$PWD/wibble/to_split/hts-ref-cache/%2s/%2s/%s samtools view wibble/to_split/colo-829.cram | wc -l -l
499500
$ REF_PATH='URL=http:://www.ebi.ac.uk/ena/cram/md5/%s' REF_CACHE=$PWD/wibble/to_split/hts-ref-cache/%2s/%2s/%s bamtofastq gz=1 exclude=SECONDARY,SUPPLEMENTARY tryoq=1 outputperreadgroup=1 outputperreadgroupprefix=colo-829 outputperreadgroupsuffixF=_1.fq.gz outputperreadgroupsuffixF2=_2.fq.gz outputperreadgroupsuffixO=_o1.fq.gz outputperreadgroupsuffixO2=_o2.fq.gz outputperreadgroupsuffixS=_s.fq.gz inputformat=cram filename=wibble/to_split/colo-829.cram outputdir=wibble/
[V] 488944
[V] MemUsage(size=505.844,rss=201.539,peak=637.629) wall clock time 19:05660900

Any thoughts?

biobambam2 2.0.86, not sure which io_lib is compiled into the bundle.

jkbonfield commented 6 years ago

On Wed, Oct 10, 2018 at 12:15:05PM +0000, Keiran Raine wrote:

I'm finding that using REF_PATH isn't working as expected.

I think the problem is it is not honouring 301 redirect codes. Perhaps this can be enabled in the code:

https://curl.haxx.se/libcurl/c/CURLOPT_FOLLOWLOCATION.html

$ REF_PATH='URL=http:://https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ebi.ac.uk_ena_cram_md5_-25s&d=DwICaQ&c=D7ByGjS34AllFgecYw0iC6Zq7qlm8uclZFI0SqQnqBo&r=wodoR_G062E4YLZ-xu5t6g&m=-Ub6gf1h1Ts4cYDiDHrHoDQfWphQW7lS728AtZKJAz8&s=-y-F2smd1D4fhfUqRkPar_Y__e8ZH7DvKtL6ZnSQhRI&e=' REF_CACHE=$PWD/wibble/to_split/hts-ref-cache/%2s/%2s/%s bamtofastq gz=1 exclude=SECONDARY,SUPPLEMENTARY tryoq=1 outputperreadgroup=1 outputperreadgroupprefix=colo-829 outputperreadgroupsuffixF=_1.fq.gz outputperreadgroupsuffixF2=_2.fq.gz outputperreadgroupsuffixO=_o1.fq.gz outputperreadgroupsuffixO2=_o2.fq.gz outputperreadgroupsuffixS=_s.fq.gz inputformat=cram filename=wibble/to_split/colo-829.cram outputdir=wibble/

I'm confused why you have http:://https:// in here? (And blergh! ^&%!-off proofpoint I want to see the proper URL again).

Io_lib now should accept single colon URL=http://www.ebi.ac.uk/(etc). Although colon is the path separator, it specifically checks for preceding http, and ftp. Hopefully similarly with port numbers. However looking at the code I see I forgot to add https! Sigh. (It's in the htslib copy.)

Note you can use '|' before a search name to avoid the repeated lookups with different file extensions. Eg:

REF_PATH='|URL=https:://www.ebi.ac.uk/ena/cram/md5/%s'

Try explicitly using https instead of http to avoid the redirect.

... for bz2, sz, Z, bz2 (oddly a second time) ...

Good spot. I think it's something to do with the magic number detection order too, but it's been year ssince I did that code. This is actually meant for finding compressed sequence chromatograms. :-)

-- James Bonfield (jkb@sanger.ac.uk) The Sanger Institute, Hinxton, Cambs, CB10 1SA

-- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

keiranmraine commented 6 years ago

Thanks James, that makes total sense. I'd have spotted that if I hadn't copied the complete URL from the forwarding 'Location' in the error when testing with curl.

WRT :: in urls, I was following a subsection of the biobambam2 readme which indicates double colon. I'd tried without and with. Now I'm using https I can see that I have to use :: otherwise I get:

CURL ERROR: Couldn't resolve host 'https'

Sadly using https I now get:

$ REF_PATH='URL=https:://www.ebi.ac.uk/ena/cram/md5/%s' ...
CURL ERROR: server certificate verification failed. CAfile: none CRLfile: none
...

Running a direct curl, I see these getting set:

$ curl -vi https://www.ebi.ac.uk/ena/cram/md5/1b22b98cdeb4a9304cb5d48026a85128 > /dev/null
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs

Is this something you would expect?

(I hate proof point too, especially when it fails to reconstruct the URL)

jkbonfield commented 6 years ago

There were a number of problems to fix.

  1. Double colon was needed for https, but not http and ftp. I simply hadn't updated the exceptions.

  2. It didn't honour redirect (301). It now does.

  3. https requires a user-agent, or at least our proxy does.

This seems to work in io_lib itself now. I tested it (with and without proxy settings) via:

http_proxy= https_proxy= REF_PATH='|https://www.ebi.ac.uk/ena/cram/md5/%s' REF_CACHE=/ scramble -H ~/scratch/data/9827_2#49.1m.cram