mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

Allow the use of a key_function in Faidx.build_index #111

Closed terrycojones closed 7 years ago

terrycojones commented 7 years ago

It's currently possible to pass a key_function to Fasta that gets passed along to the Faidx instance. But the Faidx instance only uses key_function in reading an index. It would be great if you could also use it in writing. It's possible that having a default key_function of lambda rname: rname.split()[0] might not change anything (I mean all tests might still pass). That default would do what build_index already does, but would do it via rname = key_function(line.rstrip('\n\r')[1:]). I'm not sure if there would be any negative impact on changing the default key function for read - it would only make a difference if a key had a space.

The benefit of this suggestion is that people could read and write index files that had the full FASTA id & description in them. The cli faidz too could write such an index when passed -f (instead of just printing the full names but only putting the id into the index).

Make sense?

mdshw5 commented 7 years ago

I think this change makes sense. I'd be up for making the change described here, and considering the possibility of implementing BGZF support around your comment on #77 would probably release both changes as v4.9.0.

Just so I understand your use case: you want to read or build .fai index files using the same key function, which does make sense. You'd be okay with setting the default key_function default as lambda rname: rname.split()[0], which would keep the current build behavior as dropping everything after the first whitespace character. We change the Fasta and Faidx key_function default from lambda rname: rname.

terrycojones commented 7 years ago

Hi Matt - thanks for the reply.

Yes, that's right on the key_function description.

I was thinking of sending you a pull request for #77. I've not done the work, but I stared at the code today and think I know roughly what to do. If you're happy for me to have a go, I'll try it tonight. The hardest part might be writing the tests :-)

mdshw5 commented 7 years ago

I'd love to receive a PR for #77 or this feature :). Knock yourself out!

mdshw5 commented 7 years ago

Hey @terrycojones. After refactoring a bit I've decided to add an argument to Faidx (and Fasta) called read_long_names, which defaults to false for samtools compatibility, but if set to true will read the full defline from the fasta file when reading the fasta index, and will use these full names in place of the names in the index file. I think this supports your use case, since you can then manipulate these full names using split_char, key_function, and filter_function. This was just easier and cleaner from an API perspective (using key_function for both index building and reading was problematic: think of key_function=lambda x: x.split('.')[2]), and has the benefit of not generating strange .fai files for other tools to read in and choke on. I'll cut a v0.4.9 release - please let me know if you have any needs this version doesn't address!

terrycojones commented 7 years ago

Hi Matt. Sorry for the slowness. Yes, this sounds good. I'll go find the other ticket you commented on and reply to that too.