output format - Githubissues

kamimrcht / REINDEER

REINDEER REad Index for abuNDancE quERy

GNU Affero General Public License v3.0

56 stars 6 forks source link

output format #11

Open chklopp opened 3 years ago

chklopp commented 3 years ago

Regarding the output format the documentation shows an example with

SRR10092187.6 66 2:4 but I get m64077_200131_143655/2/ccs 0-4:15,5-11:16,12-17:13,18-65:17,66-81:11,82-83:16,84-135:13,136-141:15,142-157:13,158-159:17,160-170:3,171-175:4,176-181:16,182-185:27,186-188:5,189-204:12,205-209:6,210-270:13,271-289:16,290-294... when I try to run it on long reads with version 7b957ae Could you give us more insight on the format?

rchikhi commented 3 years ago

Hi Christophe,

Indeed the README will need to be updated. New output format is: [coord1]-[coord2]:[abundance of monotig between coord1 and coord2 on query],[coord3]-[coord4]:[abundance of monotig between coord3 and coord4 on query],etc

So essentially it's a more fine-grained view of where the abundances are on the query. If you only need a mean abundance across the whole query, just do an average weighted by the differences of coordinates, i.e. (ab1(coord2-coord1)+ab2(coord4-coord3)+..) / length of query.

chklopp commented 3 years ago

Thank you Rayan,

When i try to use a bcalm data base build with 47mers (input fasta file) using -kmer-size 47 parameter then for each of my reads I only get one value

head of the output file

m64077_200131_143655/2/ccs m64077_200131_143655/13/ccs 0-56:5 m64077_200131_143655/15/ccs m64077_200131_143655/26/ccs 0-51:5 m64077_200131_143655/28/ccs 0-63:5 m64077_200131_143655/29/ccs m64077_200131_143655/34/ccs 0-77:5 m64077_200131_143655/37/ccs m64077_200131_143655/41/ccs * m64077_200131_143655/44/ccs 0-52:5 I was expecting to have the same abundance results, along the read. Why is it different?

How do I relate monotigs to kmers. I'm looking for a tool to count the number of kmer (take from a kmer list) present in each read.

rchikhi commented 3 years ago

I answered privately to use Jellyfish for such a task. Here, REINDEER approximates the count of your reads using monotigs instead of kmers, but it really doesn't lose much information. E.g. for your first read there is no kmer found, for the second one, the kmers covering positions 0 to 56 are part of a monotig which has average coverage of 5, etc..