igvteam / igv-reports

Python application to generate self-contained pages embedding IGV visualizations, with no dependency on original input files.
MIT License
350 stars 52 forks source link

Input/output error when creating IGV report #53

Closed fgvieira closed 3 years ago

fgvieira commented 3 years ago

Dear all,

I am trying to create an IGV report:

create_report --flanking 100 --info-columns DP AF ANN --sample-columns GT DP AD GQ --standalone --output out.html variants.annot.vcf.gz reference_genome.fasta --tracks sample.cram

But I am getting the error below:

[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[W::find_file_url] Failed to read reference "https://www.ebi.ac.uk/ena/cram/md5/97a317c689cbdd7e92a5c159acd290d2": Input/output error
[...]

What does this error mean? If I try access the URL, it works fine. And why is IGV trying to access ENA if I provide the reference genome?

thanks,

jrobinso commented 3 years ago

@fgvieira I actually do not know, I suspect pysam is trying to access the file based on information in the header. If so this is out of my control. Do you have, or can you create, a "bam" file version of the cram to see if the error persists?

jrobinso commented 3 years ago

I found this in the samtools reference (pysam, used by igv-reports, wraps samtools)

When reading a CRAM the @SQ headers are interrogated to identify the reference sequence MD5sum (M5: tag) and the local reference sequence filename (UR: tag). Note that http:// and ftp:// based URLs in the UR: field are not used, but local fasta filenames (with or without file://) can be used.

To create a CRAM the @SQ headers will also be read to identify the reference sequences, but M5: and UR: tags may not be present. In this case the -T and -t options of samtools view may be used to specify the fasta or fasta.fai filenames respectively (provided the .fasta.fai file is also backed up by a .fasta file).

I cannot reproduce the issue with my test CRAM files, but I suspect there is some specific combination of headers that cause samtools to try to compute a MD5Sum from the reference specified in the header. I think the fix will be to pass the reference specified in igv-reports to pysam with the "-T" argument. I am actually on vacation now until July 12 and will not try a release until I return, but I have pushed this (potential) fix. If you are able to build from source and try it please report back here. Feel free to ping this issue after July 12 as a reminder to release.

fgvieira commented 3 years ago

I tried the latest commit, but got an error:

Traceback (most recent call last):
  File "bin/create_report", line 8, in <module>
    sys.exit(main())
  File "lib/python3.9/site-packages/igv_reports/report.py", line 271, in main
    create_report(args)
  File "lib/python3.9/site-packages/igv_reports/report.py", line 51, in create_report
    reader = utils.getreader(track, None, args.fasta)
  File "lib/python3.9/site-packages/igv_reports/utils.py", line 11, in getreader
    return bam.BamReader(path)
TypeError: __init__() missing 1 required positional argument: 'fasta'

I tried adding the --fasta command line option, but it is not recognized.

jrobinso commented 3 years ago

OK I'll have to look at this next week, I'm traveling now. There is no. "--fasta" option in igv-reports, that is a pysam option, sorry if I wasn't clear.

Many people use this with CRAM files, I don't know what it is about yours that is triggering this. However, the error is coming from pysam, a python package. I suggest installing that package and trying to query the file for alignments, knowing if that is successful or not would be good information.

fgvieira commented 3 years ago

Looking a bit more into the code, I think the issue might be here: https://github.com/igvteam/igv-reports/blob/a0c558a8a16d7168ca550ff0d1816c000af3f8ff/igv_reports/feature.py#L246-L247

jrobinso commented 3 years ago

@fgvieira Sorry for the delay, could you pull and try again when you have a chance?

You were on the right track, but its a little more subtle. For CRAM format we need to distinguish between input and output formats (CRAM is output as BAM in the report).

fgvieira commented 3 years ago

Seems to be working now. :+1: thanks!