pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
779 stars 274 forks source link

handle expected [E::idx_find_and_load] without stopping execution #1214

Closed sroener closed 1 year ago

sroener commented 1 year ago

Hi,

I have a use case in which I create a bunch of temporary bam files in a multiprocessing context. After finishing the multiprocessing part, I aggregate the temporary files.

The catch is that I get a lot of Errors from htslib that are expected: [E::idx_find_and_load] because all the tempfiles are not indexed. This should be no issue, because I just read all the reads from these temp files in a final file and I made sure that everything is ordered by genomic coordinate.

From my understanding, if I would use try-except blocks, I would interrupt the aggregation of the tempfiles for catching the error.

My question is, whether there is a way to catch the Error without stopping the aggregation of tmpfiles.

I have seen the approach mentioned in #939 , but ideally would like to keep the message for a debug log.

jmarshall commented 1 year ago

This is a good reminder that the fix in #939 should be applied. But the message in that case is merely a warning that appears in passing and does not cause any failures. However in your case you seem to be saying that you would need try/except to catch some errors that you are actually encountering?

Please show us the relevant part of your code, and the errors and diagnostics you get.

sroener commented 1 year ago

@jmarshall thank you for the quick response.

Let me clear up the confusion a bit.

The code looks like this:

 with pysam.AlignmentFile(output_file, "wb", template=input_bam ,threads=out_threads) as of:
            for tmpfile in imap_res: # returns several thousand temporary files, which are not indexed
                with pysam.AlignmentFile(tmpfile, "rb", threads=in_threads) as file:
                    # read all reads from the temp file and write them to the final output file
                    for read in file.fetch(until_eof=True): 
                        of.write(read)

The Warning is: [E::idx_find_and_load] Could not retrieve index file for [...]. As you mentioned, it is not causing a failure, but as I aggregate over more than 2000 temporary files, I get the error message for each of these files. I could just ignore them, but I would rather handle them. The reasons for this are:

1) it blows up the log file 2) it makes it harder to follow what the script is doing and if there is something odd happening

Three solutions come to mind for handling this behavior:

1) Index all temporary bam files, creating many unnecessary files, just to get rid of the warnings and to delete them later. It would also create some computational overhead and I would like to avoid this. 2) Set verbosity of HTSlib to 0, effectively suppressing all errors and warnings. This would be the resource effective solution, but I feel like the design could cause some problems if an unexpected warning/error arises, without being noticed. 3) Catch the error and handle it based on some configuration. This is the solution I would prefer, but I have difficulties doing so because I don't know how to catch the warning without a try/except block.

So I guess it is partly a question whether there is a special HTSlib/pysam way to suppress a certain warning/error, and partly a more general python question if there is a way to handle the warning without interrupting the execution of the try block.

jmarshall commented 1 year ago

Okay, that clarifies that you are in the same situation as was previously encountered. Certainly thousands of spurious messages is undesirable!

I will apply the full fix described in #939 for the next pysam release, which will be fairly soon. In the meantime…

Unfortunately this message is a printf inside the C code so can’t be caught or manipulated at the Python level. The only thing you can do from Python is set the C verbosity to zero beforehand and reset it afterwards. Rest assured that any real errors will be transmitted to Python as actual errors or exceptions, so you won’t loose much by changing the verbosity.

sroener commented 1 year ago

Thank you for the clarification! I will use the verbosity fix for the time being and remove the part when the new pysam version is out.

jmarshall commented 1 year ago

The fix has now been applied in 95344cbce44c72b9e3bb877d80d2a5c657773351 and will soon appear in a release. Thanks for the timely reminder!