Closed VinzentRisch closed 1 month ago
Hi @misialq
In sample_dict of MultiFASTADirectoryFormat
you do sample_id = d.name.rsplit('/', 1)[0]
The d.name
seems to be enough why did you add the rsplit? What case does it cover?
Also in the same action there are these checks if the files match the pathspec:
if not mags_pattern.match(path.name):
continue
So it checks if the files end in fasta or fa. But isn't this redundant because in the validation of the format only files with those extensions are allowed anyway? Or am I missing something?
I am asking because I copied that sample_dict function and modified it.
Hey @VinzentRisch,
So for the d.name
question I'm not entirely sure as, tbh, I copied this method from somewhere already 😅 When I look at it now I must agree with you: I cannot think of a use case for that split. That code looks very suggestive of the name attribute containing a full path rather than just the last piece of it, in which case we'd need to do a rsplit. If that's not the case, then I guess you can just use the name directly.
Re: matching the pathspec: that is also a good question - technically, it should not be required since, as you wrote, only those files are allowed. I guess that was more of a failsafe but you could probably skip it.
Hi @misialq I created a new format called GenomeDataDirectoryFormat that the four dir formats can inherit from.
closes #344
For easier handling of
GenesDirectoryFormat
,ProteinsDirectoryFormat
,LociDirectoryFormat
andDNASequencesDirectoryFormat
in q2-amrfinderplus I created a new DirFormat calledGenomeDataDirectoryFormat
that they all inherit from. This new dirformat has a function called genome_dict.MultiMAGSequencesDirFmt
or feature_dict ofMAGSequencesDirFmt
.GenesDirectoryFormat
andProteinsDirectoryFormat
can contain files in per sample directories or not. So depending on the directory structure the output of genome_dict will be similar to sample_dict ofMultiMAGSequencesDirFmt
or feature_dict ofMAGSequencesDirFmt
.