kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

mapping_rate is wrong #58

Closed kfuku52 closed 3 years ago

kfuku52 commented 3 years ago

@Hego-CCTB numread* are unbelievably high, and consequently, no sample survives the mapping rate cutoff in curate. I'll fix this, but did you really test it when you implemented this feature? The problem was introduced by this commit https://github.com/kfuku52/amalgkit/commit/8f3fb6ce8740001aa215acba5965703e5a0e7994

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

num_read_fastp | num_read_fastp_input | num_read_fastq_dumped | num_read_fastq_rejected | num_read_fastq_written | num_read_unfiltered -- | -- | -- | -- | -- | -- 2150736821 | 2158686520 | 3201205000 | 984404600 | 3170890400 | 3201205000

Hego-CCTB commented 3 years ago

I have to admit, that is definitely a lack of proper testing on my part. I only checked if there were values being dumped and if the columns where there, but not if they actually make sense.

kfuku52 commented 3 years ago

You seem to have confused the total length with the number of reads. A program only works as you write it, not as you think it should. This is why testing is so important. In fixing this bug, I noticed that the mapping rate can be obtained from kallisto's stdout/stderr, so numread* will no longer be used for mapping rate calc from the next version. I'll push it later.

Hego-CCTB commented 3 years ago

I completely agree with you. That was a lapse of judgment from my side. Kallisto can also produce a .json file, which should contain a mapping rate, in case you don't want to parse STDOUT.

kfuku52 commented 3 years ago

The JSON file would be stable for a longer time. I'll use it. Thanks.

kfuku52 commented 3 years ago

fixed in https://github.com/kfuku52/amalgkit/commit/0d46e6abcf092aff0de2fadcd25e5d9486510827

Hego-CCTB commented 3 years ago

Quick question: Do we still have a need for updated_metadata, or can we obsolete that now?

kfuku52 commented 3 years ago

I was thinking about the same thing. The number of reads in the fastq files should be logged as it is difficult to calculate later. We should probably keep them as a log file for getfastq in /getfastq/SRR000000.