replikation / poreCov

SARS-CoV-2 workflow for nanopore sequence data
https://case-group.github.io/
GNU General Public License v3.0
39 stars 17 forks source link

collect consensus sequences in one folder #69

Closed oliverdrechsel closed 3 years ago

oliverdrechsel commented 3 years ago

Hi,

would it be possible that the final consensus sequences would be put/linked into an output folder? Anything like 'consensus_sequences' ? This would facilitate copying out all consensus sequences for further use outside of the pipeline.

hoelzer commented 3 years ago

I think having a multi FASTA at 2.Genomes/all-consensus-sequences.fasta that combines all the single FASTAs in that folder into one file should do the job? Of course, this might then still include reconstructed consensus that fail a later QC: but this can be checked in the report

oliverdrechsel commented 3 years ago

i personally would object multi fasta files as they expect that all sequencing data are delivered to the same target folder. They are much harder to split (recreating meaningful names) than single files are to fuse.

replikation commented 3 years ago

hi @oliverdrechsel

each samples consensus fasta file is located in this folder:

./<outputdirectory>/2.Genomes/<sample_name>/<samplename>_consensus.fasta

<outputdirectory> can be changed via --output flag. samplename is usually "barcode01" etc. if you start from basecalling

Multifastafiles (with QC passing) are only collected via the optional --rki flag. maybe I misunderstood the question?

hoelzer commented 3 years ago

i personally would object multi fasta files as they expect that all sequencing data are delivered to the same target folder. They are much harder to split (recreating meaningful names) than single files are to fuse.

Ah, okay now I get what you mean @oliverdrechsel . You just want to have all the single FASTAs (one per sample) in one single output folder, right? Instead of sub-folders like described by @replikation:

./<outputdirectory>/2.Genomes/<sample_name>/<samplename>_consensus.fasta

so something like:

./<outputdirectory>/2.Genomes/all/<samplename>_consensus.fasta

?

It's a minor thing but maybe we can simply publish the FASTAs also to

./<outputdirectory>/2.Genomes/all_consensus/<samplename>_consensus.fasta

Thus, we would have the per-sample folder structure to check for details (VCF, BAM, FASTA, PDF, ...) and another folder that just has all the FASTAs.

or do you want an additional output folder for all the consensuses that is outside of the

./<outputdirectory>/

structure? This would need an additional parameter e.g.

--output_consensus /some/other/path/tp/write/all/consensus/fasta
oliverdrechsel commented 3 years ago

Hi @hoelzer something like ./<outputdirectory>/2.Genomes/all/<samplename>_consensus.fasta would be fine, i think. It would be easier to distribute the data to somewhere else, if one just has to visit one folder and not iterate through various folders to get all output data.

hoelzer commented 3 years ago

Hi @hoelzer something like ./<outputdirectory>/2.Genomes/all/<samplename>_consensus.fasta would be fine, i think. It would be easier to distribute the data to somewhere else, if one just has to visit one folder and not iterate through various folders to get all output data.

Okay, I think this should be doable with e.g. an optional --collect <outputdirectory> flag