qiime2 / q2-types

BSD 3-Clause "New" or "Revised" License
17 stars 41 forks source link

ENH: Adds new format `GenomeDataDirectoryFormat` with `genome_dict` function #345

Closed VinzentRisch closed 1 month ago

VinzentRisch commented 1 month ago

closes #344

For easier handling of GenesDirectoryFormat, ProteinsDirectoryFormat, LociDirectoryFormat and DNASequencesDirectoryFormat in q2-amrfinderplus I created a new DirFormat called GenomeDataDirectoryFormat that they all inherit from. This new dirformat has a function called genome_dict.

VinzentRisch commented 1 month ago

Hi @misialq In sample_dict of MultiFASTADirectoryFormat you do sample_id = d.name.rsplit('/', 1)[0] The d.name seems to be enough why did you add the rsplit? What case does it cover?

Also in the same action there are these checks if the files match the pathspec:

                if not mags_pattern.match(path.name):
                    continue

So it checks if the files end in fasta or fa. But isn't this redundant because in the validation of the format only files with those extensions are allowed anyway? Or am I missing something?

I am asking because I copied that sample_dict function and modified it.

misialq commented 1 month ago

Hey @VinzentRisch,

So for the d.name question I'm not entirely sure as, tbh, I copied this method from somewhere already 😅 When I look at it now I must agree with you: I cannot think of a use case for that split. That code looks very suggestive of the name attribute containing a full path rather than just the last piece of it, in which case we'd need to do a rsplit. If that's not the case, then I guess you can just use the name directly.

Re: matching the pathspec: that is also a good question - technically, it should not be required since, as you wrote, only those files are allowed. I guess that was more of a failsafe but you could probably skip it.

VinzentRisch commented 1 month ago

Hi @misialq I created a new format called GenomeDataDirectoryFormat that the four dir formats can inherit from.