Close out BGI Genome Projects

sr320 commented 7 years ago

[x] Indicate we would like the partial refund option
[ ] Obtain complete details on what was carried out as reports are not clear / inaccurate. In a manner suitable for publication
[x] Obtain validated version of all raw data.

kubu4 commented 7 years ago

Turns out, all of the FASTQ file "versions" we've received from them are all identical - only the names differ. Verified by checking md5 and they all match.

sr320 commented 7 years ago

So one raw file is corrupt on their end?

kubu4 commented 7 years ago

I finally figured it out! The two original "corrupt" files are actually incomplete downloads! That's why checksums don't match AND why they cannot be unzipped/FASTQC'd!

kubu4 commented 7 years ago

Knowing this, what should we do with these data sets? Here are a couple of things:

Should I replace the two incomplete FASTQ files in /owl/web/nightingales/O_lurida
1. If yes, should I rename the files so they match the others?
2. If no, I'll explain filename differences (and history) in the readme file.
We know have multiple copies of the same data in multiple locations on Owl. Should we retain all these copies, even though they're all the same data?

sr320 commented 7 years ago

Replace the incomplete, naming as others. I guess I do not really care about other locations on owl, only nightingales and the corresponding sheet that describes the files and libraries.

On Tue, Jan 3, 2017 at 8:35 AM kubu4 notifications@github.com wrote:

Knowing this, what should we do with these data sets? Here are a couple of things:

Should I replace the two incomplete FASTQ files in /owl/web/nightingales/O_lurida

If yes, should I rename the files so they match the others?

If no, I'll explain filename differences (and history) in the readme file.

We know have multiple copies of the same data in multiple locations on Owl. Should we retain all these copies, even though they're all the same data?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/392#issuecomment-270157215, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt7GTMQcjWLn9zSaeKWU6VzszLM0Fks5rOnh9gaJpZM4LNEmM .

kubu4 commented 7 years ago

Here's the info Frank (BGI rep) provided me with (he attached a bunch of screenshots; sorry). I think most of it my just be taken from the project report PDF...

1).data filter: filter reads with adapter

                 filter reads with small size

                 filter reads with >2% base is N

                 filter reads with many low quality(<30) bases

2).kmer : software: kmerfreq 1.0

          Genome Size=K-mer_num/Peak_depth

3).simulate_heterozygosis : software readsim 2.2

                                    parameter: false ratio: 0.001/0.0015/0.002

                                                     repeat ratio:0.15/0.3/0.5/0.7

                                                     hybrid ratio:0.005/0.01/0.015/0.02

4).assembly: software : SOAPdenovo_v2.01

5).GC_depth: inhouse program

6).close gaps: software: krskgf 1.19 / GapCloser_v1.12

We used the script below to generate the fa file.

SOAPdenovo_v2.01 pregraph -s lib.cfg -d 1 -K * -o out.prefix >pregraph.log

SOAPdenovo_v2.01 contig -D 1 -M 2 -g out.prefix >contig.log

SOAPdenovo_v2.01 map -s lib.cfg -g out.prefix >map.log

SOAPdenovo_v2.01 scaff -g out.prefix -F >scaff.log

sr320 commented 7 years ago

So what is the resulting fasta file of the genome?

I have become confused on their two data dumps to us in 2016.

kubu4 commented 7 years ago

The resulting FASTA file is Ostrea_lurida.scafSeq

sr320 commented 7 years ago

That is what I thought, but then what is scaffold.fa.fill ?

kubu4 commented 7 years ago

That's the initial assembly using only the small insert libraries. You'll notice that there's a TON of missing sequence (i.e. NNNNNNN) in that FASTA file because it hadn't been combined with the large insert libraries (which hadn't been sequenced when they produced that initial assembly).

sr320 / LabDocs

Close out BGI Genome Projects #392