Option for failing if index not found

marcelm commented 6 years ago

We’re using pyfaidx in WhatsHap, a software for read-based phasing. When running multiple instances of the program in parallel that all access the same indexed FASTA (happens when working with multiple samples or when we parallelize by chromosome), we have the problem that the generated .fai files clobber each other.

To avoid the problem, I’d like to prevent pyfaidx from creating the index file. We could then require that the index file exists before the program is run. Would it be possible to add an option such as fail_if_index_not_found to the Fasta() constructor?

mdshw5 commented 6 years ago

This is an interesting idea. I'm assuming you're parallelizing by running multiple instances of your script, and not using Python's threading or multiprocessing libraries? If the latter, you shouldn't have to do anything as I've tried to make index creation thread-safe by locking around I/O code:

https://github.com/mdshw5/pyfaidx/blob/442a84dbe89b7e1158c2ef7af94f44846f8c87c4/pyfaidx/__init__.py#L362-L379

Otherwise I could see adding an argument such as Fasta(build_fai=True) so that you can override the default behavior if you're going to enforce the index creation ahead of time. This could just propagate the IOError if you need to catch this.

However I'd like to propose an even simpler solution. Are you currently generating the .fai index ahead of time? If so then the Fasta constructor should find and use the existing index file. You say:

We could then require that the index file exists before the program is run.

Maybe that's a solution that doesn't require any new functionality?

mdshw5 commented 6 years ago

For creating the index file ahead of time, if you're calling a series of scripts, you can just run the faidx utility that's bundled in pyfaidx as a preprocessing step in your pipeline. If you want to suppress the default output of faidx you can invert the default match (all entries in the file) like so: faidx -v file.fa and this will just index the file if needed.

marcelm commented 6 years ago

Thanks for the quick reply. Yes, I had actually looked at the sources already to find out whether there was a parameter that did what I wanted and had noticed the Lock around index creation. But, as you guessed correctly, it doesn’t help me because it’s multiple processes in our case that are running in parallel.

However I'd like to propose an even simpler solution. Are you currently generating the .fai index ahead of time? If so then the Fasta constructor should find and use the existing index file.

When loading the genome reference in WhatsHap, I simply want to require that the .fai file exists, thus forcing the user to have created it before they run our program. On the other hand, I also want to guarantee that we never write a .fai file ourselves. (The user may not even have permissions to create a file in the directory where the FASTA file is located.)

I have now implemented it such that WhatsHap aborts if it cannot find a .fai file – before it then goes on to instantiate the Fasta() file. This of course works most of the time, but dealing with index files (such as where they are located and how they are named) should be pyfaidxs concern. It would be much nicer if I could just switch off index creation.

Note one problem with this approach: After checking for the file’s existence, the file could be deleted, leading to pyfaidx creating it again. Not likely to happen, but these race conditions lead to hard to find bugs.

I admit I’m a bit biased by now because of the problems we had, but I actually found it surprising that pyfaidx creates an index automatically at all, without explicitly being asked to do so. It’s of course too late now to change this, but from an application developer’s perspective, “explicit is better than implicit” and opt-in would perhaps have been a better choice.

We do have a working solution at the moment, so feel free to proceed as you wish!

marcelm commented 6 years ago

This could just propagate the IOError if you need to catch this.

In case you are going for adding a new parameter: It would be nice if there were a separate exception (IndexNotFoundError or whatever) so the cases “FASTA file not found” and “the index is missing” could be distinguished. That would make it easier to present sensible error messages to our users :-)

mdshw5 commented 6 years ago

Thanks for the clarification.

I actually found it surprising that pyfaidx creates an index automatically at all

Yeah, in the early days I was concerned about this, but rationalized the behavior because "samtools does it this way".

It sounds like I can implement two minor features to improve the Fasta object interface:

Add build_fai parameter to the Fasta constructor, defaulting to True
Add exceptions for IndexNotFoundError and FastaNotFoundError

I'll add these two in a feature branch and then release ASAP

marcelm commented 6 years ago

Yes, these changes would be exactly what we need!

Yeah, in the early days I was concerned about this, but rationalized the behavior because "samtools does it this way".

If it’s any comfort: We initially built WhatsHap to automatically create an index for its input BAM files if there was no .bai. I had to remove that code today because of the same parallelization issues.

Thanks for the quick fix! If you do release within the days, then we could already depend on this new version since we’re planning a release as well. (But no rush, the existing code works also.)

Thanks!

mdshw5 commented 6 years ago

I've just released a tagged version to CI, so if test pass it should be on PyPI in a few minutes.

marcelm commented 6 years ago

I’m glad I reported this instead of just leaving our work-around in place. Thanks again!

marcelm commented 6 years ago

I took the liberty of updating the recipe in bioconda: https://github.com/bioconda/bioconda-recipes/pull/7527

mdshw5 commented 6 years ago

Thanks! I wish there was some automated way of pushing new releases into Bioconda.

mdshw5 / pyfaidx

Option for failing if index not found #134