Questions about using the seqcluster json output for some pipeline analysis

BioinfoHub-PeiQinNg commented 4 years ago

Hi there! I am currently using seqcluster to do some pipeline analysis on small RNAs, which requires me to extract some information that is embedded in the json output for seqcluster.

These are the two questions I have regarding the json file: 1) I would like to confirm that the order of the entries in the json file is as follows: metacluster -> the loci information -> cluster ID -> chromosome -> start -> end -> strand -> number of sequence within the cluster ?

2) What are the factors that determine which small RNA sequences are reported in the json file ie. the ones with the most clusters?

Thank you. It would be great if you can help clarify these issues.

lpantano commented 4 years ago

Hi,

Thank you for the questions.

1-that seems correct. But cluster ID is really Locus ID. 2-The sequences that are removed are the ones that map to many time to the genome, when sequence_counts/total_hits_genome < 0.1. Other wise the sequences should be there if they are assigned to a meta-cluster. IF a locus has then last 10 sequences won’t be kept, so sequences that group into less than that number are lost as well.

Cheers

BioinfoHub-PeiQinNg commented 4 years ago

Hi,

Thank you for answering my questions. Just to clarify for 2- If I understood correctly, a locus that has less than 10 reads mapped will not be reported right?

Also, I have noticed with the annotation file I have supplied, some of the annotation in the counts.tsv file returns with a pipe symbol "|" under the ann column. Would it be fair to say that these ones are intergenic regions?

Besides the json files, is there a more effective way of retrieving the locus regions? I am not sure if I have missed a quicker way of extracting the cluster region (ie. locus region) while looking through the seqcluster output.

Thank you so much for clarifying these issues I have while working with seqcluster.

lpantano commented 4 years ago

On October 17, 2019 at 12:18:28 AM, BioinfoHub-PeiQinNg ( notifications@github.com) wrote:

Hi,

Thank you for answering my questions. Just to clarify for 2- If I understood correctly, a locus that has less than 10 reads mapped will not be reported right?

Yes

Also, I have noticed with the annotation file I have supplied, some of the annotation in the counts.tsv file returns with a pipe symbol "|" under the ann column. Would it be fair to say that these ones are intergenic regions?

Yes

Besides the json files, is there a more effective way of retrieving the locus regions? I am not sure if I have missed a quicker way of extracting the cluster region (ie. locus region) while looking through the seqcluster output.

There is a positions.bed file with all the cluster/loci positions only, if you are interested in that.

Thank you so much for clarifying these issues I have while working with seqcluster.

:)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/49?email_source=notifications&email_token=AAML6HGHS7X35PNPNU4NDB3QO7RRHA5CNFSM4JAWWLP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBOWTZQ#issuecomment-542992870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAML6HC5EMUT573AC5ITTF3QO7RRHANCNFSM4JAWWLPQ .

lpantano / seqcluster

Questions about using the seqcluster json output for some pipeline analysis #49