Public mean coverage data

adriafarres commented 1 year ago

Hello,

Is there any public dataset or website where one can download an already pre-computed mean_coverage.txt file? I have a very small dataset for which I'm trying to compute CNVs.

Thank you

tf2 commented 1 year ago

Sure - i dont think it would make sense to use one computed on a different datasets to you own - this is only used to exclude positions from CNV calling (based on variability of the position) - CNest is not really designed to operate on very small datasets, having said that it probably will work quite well. How many samples do you have?

There would be a way to skip this and just use all positions... or if you have a fair amount of samples will probably still work ok....

adriafarres commented 1 year ago

Thank you for your reply, Tomas.

As of right now I'm interested in finding CNVs for 3 genomes. It can't even be considered a dataset haha. One of these genomes was sequenced by Dante Labs. They provide a list of CNVs that are obtained with Dragen, Illumina's caller (the other ones didn't come with CNVs). After annotating those CNVs and filtering them by haploinsufficiency, I noticed that there's a a bunch of them in genes that are highly haploinsufficiency and in regions that are extremely conserved (according to gnomAD and the data from this study).

Furthermore, I annotated the others with CNVPytor (without mean coverage) and the CNVs are vastly distinc, which is highly suspicious considering those genomes belong to siblings. So at this point I don't know if the callers (or CNV callers in general) are very imprecise, if the lack of mean coverage really affects the output, or if Dragen's CNVs are actually correct. I was hoping I could get a second opinion on those CNVs by running CNest.

Maybe you can shed some light on this.

Thank you.

tf2 commented 1 year ago

Im afraid to say that with only 3 genomes CNest is not really appropriate to use - it needs to estimate a base line and certian other noise characteristics - and with 3 that is just not going to be enough. Another complication is it seems these genomes are related right, siblings? This is going to be tricky because the way CNest works (and I believe most CNV callers) will use other samples in the set to create a baseline - if its only related samples in the set it is very likely that the real CNV events might be normalised out i.e. deletion seen it most of the samples will look like normal copy number (2 copies) etc.

Is there any way for you to obtain a set of e.g. 5 unrelated genomes from the same sequencing platform to help create a baseline?

adriafarres commented 1 year ago

Thank you again for your reply. I will try to do that. Can the results vary substantially when using different sequencing machines? I apologize if these questions are very basic. I have never worked with CNVs.

tf2 / CNest

Public mean coverage data #15