Closed marcelm closed 6 years ago
This is an interesting idea. I'm assuming you're parallelizing by running multiple instances of your script, and not using Python's threading
or multiprocessing
libraries? If the latter, you shouldn't have to do anything as I've tried to make index creation thread-safe by locking around I/O code:
Otherwise I could see adding an argument such as Fasta(build_fai=True)
so that you can override the default behavior if you're going to enforce the index creation ahead of time. This could just propagate the IOError
if you need to catch this.
However I'd like to propose an even simpler solution. Are you currently generating the .fai
index ahead of time? If so then the Fasta
constructor should find and use the existing index file. You say:
We could then require that the index file exists before the program is run.
Maybe that's a solution that doesn't require any new functionality?
For creating the index file ahead of time, if you're calling a series of scripts, you can just run the faidx
utility that's bundled in pyfaidx
as a preprocessing step in your pipeline. If you want to suppress the default output of faidx
you can invert the default match (all entries in the file) like so: faidx -v file.fa
and this will just index the file if needed.
Thanks for the quick reply. Yes, I had actually looked at the sources already to find out whether there was a parameter that did what I wanted and had noticed the Lock
around index creation. But, as you guessed correctly, it doesn’t help me because it’s multiple processes in our case that are running in parallel.
However I'd like to propose an even simpler solution. Are you currently generating the .fai index ahead of time? If so then the Fasta constructor should find and use the existing index file.
When loading the genome reference in WhatsHap, I simply want to require that the .fai
file exists, thus forcing the user to have created it before they run our program. On the other hand, I also want to guarantee that we never write a .fai
file ourselves. (The user may not even have permissions to create a file in the directory where the FASTA file is located.)
I have now implemented it such that WhatsHap aborts if it cannot find a .fai
file – before it then goes on to instantiate the Fasta()
file. This of course works most of the time, but dealing with index files (such as where they are located and how they are named) should be pyfaidx
s concern. It would be much nicer if I could just switch off index creation.
Note one problem with this approach: After checking for the file’s existence, the file could be deleted, leading to pyfaidx
creating it again. Not likely to happen, but these race conditions lead to hard to find bugs.
I admit I’m a bit biased by now because of the problems we had, but I actually found it surprising that pyfaidx
creates an index automatically at all, without explicitly being asked to do so. It’s of course too late now to change this, but from an application developer’s perspective, “explicit is better than implicit” and opt-in would perhaps have been a better choice.
We do have a working solution at the moment, so feel free to proceed as you wish!
This could just propagate the IOError if you need to catch this.
In case you are going for adding a new parameter: It would be nice if there were a separate exception (IndexNotFoundError
or whatever) so the cases “FASTA file not found” and “the index is missing” could be distinguished. That would make it easier to present sensible error messages to our users :-)
Thanks for the clarification.
I actually found it surprising that pyfaidx creates an index automatically at all
Yeah, in the early days I was concerned about this, but rationalized the behavior because "samtools does it this way".
It sounds like I can implement two minor features to improve the Fasta
object interface:
build_fai
parameter to the Fasta
constructor, defaulting to True
IndexNotFoundError
and FastaNotFoundError
I'll add these two in a feature branch and then release ASAP
Yes, these changes would be exactly what we need!
Yeah, in the early days I was concerned about this, but rationalized the behavior because "samtools does it this way".
If it’s any comfort: We initially built WhatsHap to automatically create an index for its input BAM files if there was no .bai
. I had to remove that code today because of the same parallelization issues.
Thanks for the quick fix! If you do release within the days, then we could already depend on this new version since we’re planning a release as well. (But no rush, the existing code works also.)
Thanks!
I've just released a tagged version to CI, so if test pass it should be on PyPI in a few minutes.
I’m glad I reported this instead of just leaving our work-around in place. Thanks again!
I took the liberty of updating the recipe in bioconda: https://github.com/bioconda/bioconda-recipes/pull/7527
Thanks! I wish there was some automated way of pushing new releases into Bioconda.
We’re using pyfaidx in WhatsHap, a software for read-based phasing. When running multiple instances of the program in parallel that all access the same indexed FASTA (happens when working with multiple samples or when we parallelize by chromosome), we have the problem that the generated
.fai
files clobber each other.To avoid the problem, I’d like to prevent pyfaidx from creating the index file. We could then require that the index file exists before the program is run. Would it be possible to add an option such as
fail_if_index_not_found
to theFasta()
constructor?