lilit-nersisyan / computel

Compute mean telomere length from Whole Genome Sequencing data.
http://big.sci.am/software/computel/
13 stars 11 forks source link

estimate base coverage #2

Closed danshu closed 8 years ago

danshu commented 8 years ago

Hi,

I set the following parameters in the configuration file and I expected computel to estimate base coverage using the base.cov formula in the paper. "###base_coverage_calculation_options compute.base.cov F

base.cov base.index.pathtoprefix /tools/analysis/repeat/computel-0.2/src/examples/base.index/base_index

estimate.base.cov T genome.length 300000000"

However, I found that the mean telomere length calculated is extremely large and "base.cov=1" in the file "tel.length.xls", showing that there is something wrong with base.cov calculation.

Best, Quan

lilit-nersisyan commented 8 years ago

Hi Quan,

Yes, if the base.cov is 1 in the output it means that it was not estimated properly. Have you saved the log while running Computel? I could look at the input parameters.

Also, there should be a file named "length" generated during coverage estimation: let me know if you have that, and, if yes, what's written in there.

I notice that there are spaces between the parameters and their values (should be tabs): don't know if this is the case in your config file or it's just the copy-paste result: can you check this?

Thanks, Lilit

danshu commented 8 years ago

Hi Lilit,

The spaces between the parameters and their values are just the copy-paste results. I don't have files named "length" generated. I found the following warning message in the log. "Warning: base coverage is estimated only from gzip or bzip2 compressed files. Default base coverage 1 assigned. property " base.cov " not found. Default value assigned: 1 "

Does this means that I can not estimate base coverage when the input files are not compressed?

Thanks, Quan

lilit-nersisyan commented 8 years ago

Dear Quan,

Ok, that's the reason: thanks for pointing this out.

The estimate base coverage was initially designed for compressed files. The reason is that if you have the uncompressed file you can simply count the number of reads (the number of lines in the file divided by 4), and then estimate the base coverage with the formula numofreads*readlength/genomelength.

I will implement this for uncompressed files (so that users not do that by hand) and get back to you soon.

Cheers, Lilit

danshu commented 8 years ago

Dear Lilit,

Thanks for your help!

Best, Quan

lilit-nersisyan commented 7 years ago

Dear Quan,

You may want to checkout Computel v0.3: it's much simpler to use (it uses a shell script with simple basic usage), and it estimates the base coverage from fastq files by default. Your feedback would be valuable.

Cheers, Lilit

danshu commented 7 years ago

Dear Lilit,

Thanks! I will try it!

Best, Quan

danshu commented 7 years ago

@lilit-nersisyan

I have tried Computel v0.3. It is great and quite easy to use! One thing I want to know is that how Computel calculate read.length. My reads were trimmed for adapters and thus they have varying read length up to 125 bp. The read.length in the output is 124 instead. So is Computel aware of varying read length?

danshu commented 7 years ago

And I noticed that using the new version seems get smaller estimate than using previous version. I'm not sure about this.

lilit-nersisyan commented 7 years ago

Dear Quan,

Thank you for the feedback. Trimmed files is something we haven't thought of when designing Computel: it assumes the reads all have the same length, as with high quality Illumina reads. Thus, Computel just takes the first read in the first fastq file and estimates its length. Thus, I assume the read length of the first read in your case was 124. What read length have you supplied to Computel before? Probably, the difference in read lengths is the reason for different estimates.

Meanwhile, if the difference between read lengths is not that large (I hope it isn't), I suggest modifying the fastq files to put a read with an average length in the first line. This should be a valid turnaround: however we haven't done any tests for this case to be sure about this...

I'll think of modifications to the Computel workflow to account for differing read lengths, but this will take time.

Best, Lilit

danshu commented 7 years ago

Dear Lilit,

Thanks for your explanation. In my case I think the difference between read length 124 and 125 will not influence too much of the results. It's also easy to modifying the fastq files to put a read with an average length in the first line.

In case that the user is not aware of how Computel computes read length and the read length of first read varies much from the average read length, computel may give wrong results. So I think it's better to let the user to set the read length or just use the mean read length.

Best, Quan

lilit-nersisyan commented 7 years ago

Dear Quan,

Thank you for the suggestions. I'll try to modify Computel so that the varying read lengths do not affect the result. However, the single nucleotide difference should not make much difference in your case. What difference do you observe between estimates by Computel v0.2 and v0.3?

Best, Lilit

On Mon, Nov 21, 2016 at 12:06 PM, danshu notifications@github.com wrote:

Dear Lilit,

Thanks for your explanation. In my case I think the difference between read length 124 and 125 will not influence too much of the results. It's also easy to modifying the fastq files to put a read with an average length in the first line.

In case that the user is not aware of how Computel computes read length and the read length of first read varies much from the average read length, computel may give wrong results. So I think it's better to let the user to set the read length or just use the mean read length.

Best, Quan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lilit-nersisyan/computel/issues/2#issuecomment-261869516, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPMcS8gixjMJ79HVw5gDdxsS5PkfvU9ks5rAVEQgaJpZM4Kadhi .

Lilit Nersisyan Junior Researcher, MSc Institute of Molecular Biology NAS RA 7, Hasratyan str., 0014, Yerevan, Armenia Tel.: +37494 601703, +37410 282622

danshu commented 7 years ago

Dear Lilit,

It is possibly due to the difference in seed length. Sorry for that.

Best, Quan