Closed terrycojones closed 7 years ago
I think this change makes sense. I'd be up for making the change described here, and considering the possibility of implementing BGZF support around your comment on #77 would probably release both changes as v4.9.0.
Just so I understand your use case: you want to read or build .fai index files using the same key function, which does make sense. You'd be okay with setting the default key_function
default as lambda rname: rname.split()[0]
, which would keep the current build behavior as dropping everything after the first whitespace character. We change the Fasta
and Faidx
key_function
default from lambda rname: rname
.
Hi Matt - thanks for the reply.
Yes, that's right on the key_function
description.
I was thinking of sending you a pull request for #77. I've not done the work, but I stared at the code today and think I know roughly what to do. If you're happy for me to have a go, I'll try it tonight. The hardest part might be writing the tests :-)
I'd love to receive a PR for #77 or this feature :). Knock yourself out!
Hey @terrycojones. After refactoring a bit I've decided to add an argument to Faidx
(and Fasta
) called read_long_names
, which defaults to false for samtools compatibility, but if set to true will read the full defline from the fasta file when reading the fasta index, and will use these full names in place of the names in the index file. I think this supports your use case, since you can then manipulate these full names using split_char
, key_function
, and filter_function
. This was just easier and cleaner from an API perspective (using key_function
for both index building and reading was problematic: think of key_function=lambda x: x.split('.')[2]
), and has the benefit of not generating strange .fai files for other tools to read in and choke on. I'll cut a v0.4.9 release - please let me know if you have any needs this version doesn't address!
Hi Matt. Sorry for the slowness. Yes, this sounds good. I'll go find the other ticket you commented on and reply to that too.
It's currently possible to pass a
key_function
toFasta
that gets passed along to theFaidx
instance. But theFaidx
instance only useskey_function
in reading an index. It would be great if you could also use it in writing. It's possible that having a defaultkey_function
oflambda rname: rname.split()[0]
might not change anything (I mean all tests might still pass). That default would do whatbuild_index
already does, but would do it viarname = key_function(line.rstrip('\n\r')[1:])
. I'm not sure if there would be any negative impact on changing the default key function for read - it would only make a difference if a key had a space.The benefit of this suggestion is that people could read and write index files that had the full FASTA id & description in them. The cli faidz too could write such an index when passed
-f
(instead of just printing the full names but only putting the id into the index).Make sense?