Move coverage.tgz and alignment.tar to be part of the per genome predictions

qiita-spots / qiita

Qiita - A multi-omics databasing effort

http://qiita.microbio.me

BSD 3-Clause "New" or "Revised" License

120 stars 80 forks source link

Move coverage.tgz and alignment.tar to be part of the per genome predictions #3414

Closed wasade closed 2 weeks ago

wasade commented 3 months ago

Currently, the coverage and alignment data are associated with the "Alignment profile" artifact, which contains a "none.biom" table. The features of the "none.biom" table are lineage strings, but the identifiers in coverage and the alignment data are the genome IDs. The feature table relevant to the coverage and alignment data are stored in a separate artifact "Per genome Predictions".

It would be helpful if the coverage and alignment data were to co-located with the corresponding feature table.

antgonza commented 3 months ago

I actually like this; should we just then remove the "Alignment profile" artifact? @qiyunzhu, what do you think?

wasade commented 3 months ago

Sorry slightly misspoke, coverages is also in per genome predictions with the same checksum so it may be replicated information

antgonza commented 3 months ago

That's correct for coverages, they are copied to all artifacts. However, I think we can move the alignment.tar file to the per-genome table; if that's agreeable.

wasade commented 3 months ago

:+1: on mv'ing alignment data but why retain a duplicate of the coverage data?

antgonza commented 3 months ago

Well, the plan is that one day we will be able to merge tables for meta-analysis and at the same time merge the coverage data on the fly. To allow this, the easiest is to have the coverage data living within the main biom for all tables. The good thing is that this file is small (compared to everything else we are storing).

wasade commented 3 months ago

Why not just use a symlink? The files look like they're ~1GB too?

qiyunzhu commented 3 months ago

@antgonza @wasade I thought that the features of the "none.biom" table are genome IDs, not lineage strings. Are they? This is the single most important output of Woltka (i.e., the OGU table). I am totally fine with removing Alignment Profile because the name and its content seem not matching according to your description. I think that "alignment.tar" can be separate from any of the BIOM tables, because it is the output of Bowtie2 and input of Woltka, which leads to all tables. Logically it should be separate.

antgonza commented 3 weeks ago

FWIW, this is being addressed here: https://github.com/qiita-spots/qp-woltka/pull/30