ksahlin / NGSpeciesID

Reference-free clustering and consensus forming of long-read amplicon sequencing
GNU General Public License v3.0
50 stars 16 forks source link

Clarification on NGSpID output #24

Open omarkr8 opened 1 year ago

omarkr8 commented 1 year ago

I'm looking at the output from the pipeline and referring to text on the main page :

"The final cluster information is given in a tsv file final_clusters.tsv present in the specified output folder. In the cluster TSV-file, the first column is the cluster ID and the second column is the read accession. if there are n reads there will be n rows. Some reads might be singletons. The rows are ordered with respect to the size of the cluster (largest first)."

please do correct me if my understanding is lacking here.. 1) so let's say my _finalclusters.tsv has 31k rows. so that's 31k reads from the fastq. the first column goes from 0-22, so there's 23 clusters. 0 and 1 has far more rows than the others. so more reads were used for those. the remaining rows are just sequencing run details.

2) supposedly each cluster is processed in its own medaka_clid# folder. but I only see 17 folders. why not 23, one for each cluster? this sample has 9 singleton clusters. another sample has 0-36 clusters, but 28 medaka folders, 12 of those clusters are singletons.

3) I assume that the first medaka folder in the directory corresponds to the first cluster (cluster 0). Is there a way to verify this? or perhaps it is practical to parse the read number from the consensus.fasta headers?

4) does anyone have a loop handy to cat all the consensus.fastas from a single sample, and loop through multiple samples?

I guess im needing more explanation on how these clusters are designated. but its a very cool tool for sure!

ksahlin commented 1 year ago
  1. Correct, but column 2 refer to the actual read id, see https://github.com/ksahlin/isONclust#output
  2. NGSpeciesID does several more steps in between clustering and final consensus; (i) use only clusters with more that a fraction of total reads, (ii) reverse complement removal, and (iii) barcode removal. Therefore, some clusters(/consensuses) may be further merged and removed based on low fraction. This is why you don't see all 23 consensus. (see Figure 1 in the paper for the whole pipeline)
  3. Yes, the medaka folder id corresponds to the cluster id.
  4. I think it succices to cat folder/medaka_cl_id*/consensus.fasta > all_consensi.fa but maybe your looking for something more advanced.
omarkr8 commented 1 year ago

Thanks. some follow-up 2) so between final_cluster.tsv, and actual medaka content, some clusters will be dropped, or merged. is the fate of the clusters recorded somewhere? I assume that high abundance clusters are unlikely to be dropped this way. 3) in terms of finding out the size of each cluster, is it correct to parse the headers? for example, ">consensus_cl_id_414_total_supporting_reads_31235_segment0" would mean this cluster has 31235 reads associated with it? 4) thanks, i ended up writing something similar.

ksahlin commented 1 year ago
  1. Not really, but you can see from the output 'how many passed the abundance threshold'. Also, there is a parameter to set this lower if you want more clusters through. Similarly, output will say how many reverse complements it detected and merged.
  2. yes.