ENH: Adds new format `GenomeDataDirectoryFormat` with `genome_dict` function

qiime2 / q2-types

BSD 3-Clause "New" or "Revised" License

17 stars 41 forks source link

ENH: Adds new format `GenomeDataDirectoryFormat` with `genome_dict` function #345

Closed VinzentRisch closed 1 month ago

VinzentRisch commented 1 month ago

closes #344

For easier handling of GenesDirectoryFormat, ProteinsDirectoryFormat, LociDirectoryFormat and DNASequencesDirectoryFormat in q2-amrfinderplus I created a new DirFormat called GenomeDataDirectoryFormat that they all inherit from. This new dirformat has a function called genome_dict.

It will return a dict of file names as keys and full paths to files as values.
These functions will work similarly to sample_dict of MultiMAGSequencesDirFmt or feature_dict of MAGSequencesDirFmt.
GenesDirectoryFormat and ProteinsDirectoryFormat can contain files in per sample directories or not. So depending on the directory structure the output of genome_dict will be similar to sample_dict of MultiMAGSequencesDirFmt or feature_dict of MAGSequencesDirFmt.

VinzentRisch commented 1 month ago

Hi @misialq In sample_dict of MultiFASTADirectoryFormat you do sample_id = d.name.rsplit('/', 1)[0] The d.name seems to be enough why did you add the rsplit? What case does it cover?

Also in the same action there are these checks if the files match the pathspec:

                if not mags_pattern.match(path.name):
                    continue

So it checks if the files end in fasta or fa. But isn't this redundant because in the validation of the format only files with those extensions are allowed anyway? Or am I missing something?

I am asking because I copied that sample_dict function and modified it.

misialq commented 1 month ago

Hey @VinzentRisch,

So for the d.name question I'm not entirely sure as, tbh, I copied this method from somewhere already 😅 When I look at it now I must agree with you: I cannot think of a use case for that split. That code looks very suggestive of the name attribute containing a full path rather than just the last piece of it, in which case we'd need to do a rsplit. If that's not the case, then I guess you can just use the name directly.

Re: matching the pathspec: that is also a good question - technically, it should not be required since, as you wrote, only those files are allowed. I guess that was more of a failsafe but you could probably skip it.

VinzentRisch commented 1 month ago

Hi @misialq I created a new format called GenomeDataDirectoryFormat that the four dir formats can inherit from.