s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

Stats file format issue #221

Closed joe-angell closed 4 years ago

joe-angell commented 4 years ago

Is your enhancement request related to a problem? Please describe. The top level of the stats file is json object, which should have unique keys (therefor you couldn't run same thing in the pipeline more than once). Also the order is not retained in an object, so you lose the order in which the pipeline was run.

Describe the solution you'd like Top level should be changed to a list of objects. [ { "Program": "hts_Stats", "Program_details": {...} ... }, {"Program": "hts_SeqScreener", ...} ]

joe-angell commented 4 years ago

@msettles recommend we do this before the integration with multiqc is done.

msettles commented 4 years ago

So yes I do agree this is an issue and we do need to resolve soon, however vector format would be super ugly, at least for the R implementation. I'd first want to test it and make sure it doesn't cause any issues. if it does we may want to think of another alternative, or go back to adding in the pid. I'll test this weekend

joe-angell commented 4 years ago

So a json object is an unordered data structure, we cannot know the order the programs run in unless we use a list. We might be able to use a start timestamp but I think the shell will launch all the programs at the same time so that probably won't work. The order of object elements is implementation specific, I'm surprised you haven't run into other issues with this.

joe-angell commented 4 years ago

chain.log example file, program name is in [0].program_details.program

joe-angell commented 4 years ago

we decided this format will work:

library(jsonlite)
j <- fromJSON("chain.log", simplifyDataFrame = F)
samhunter commented 4 years ago

Matt says: image