Closed sr320 closed 7 years ago
Turns out, all of the FASTQ file "versions" we've received from them are all identical - only the names differ. Verified by checking md5 and they all match.
So one raw file is corrupt on their end?
I finally figured it out! The two original "corrupt" files are actually incomplete downloads! That's why checksums don't match AND why they cannot be unzipped/FASTQC'd!
Knowing this, what should we do with these data sets? Here are a couple of things:
Replace the incomplete, naming as others. I guess I do not really care about other locations on owl, only nightingales and the corresponding sheet that describes the files and libraries.
On Tue, Jan 3, 2017 at 8:35 AM kubu4 notifications@github.com wrote:
Knowing this, what should we do with these data sets? Here are a couple of things:
- Should I replace the two incomplete FASTQ files in /owl/web/nightingales/O_lurida
- If yes, should I rename the files so they match the others?
- If no, I'll explain filename differences (and history) in the readme file.
- We know have multiple copies of the same data in multiple locations on Owl. Should we retain all these copies, even though they're all the same data?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sr320/LabDocs/issues/392#issuecomment-270157215, or mute the thread https://github.com/notifications/unsubscribe-auth/AEPHt7GTMQcjWLn9zSaeKWU6VzszLM0Fks5rOnh9gaJpZM4LNEmM .
Here's the info Frank (BGI rep) provided me with (he attached a bunch of screenshots; sorry). I think most of it my just be taken from the project report PDF...
1).data filter: filter reads with adapter
filter reads with small size
filter reads with >2% base is N
filter reads with many low quality(<30) bases
2).kmer : software: kmerfreq 1.0
Genome Size=K-mer_num/Peak_depth
3).simulate_heterozygosis : software readsim 2.2
parameter: false ratio: 0.001/0.0015/0.002
repeat ratio:0.15/0.3/0.5/0.7
hybrid ratio:0.005/0.01/0.015/0.02
4).assembly: software : SOAPdenovo_v2.01
5).GC_depth: inhouse program
6).close gaps: software: krskgf 1.19 / GapCloser_v1.12
SOAPdenovo_v2.01 pregraph -s lib.cfg -d 1 -K * -o out.prefix >pregraph.log
SOAPdenovo_v2.01 contig -D 1 -M 2 -g out.prefix >contig.log
SOAPdenovo_v2.01 map -s lib.cfg -g out.prefix >map.log
SOAPdenovo_v2.01 scaff -g out.prefix -F >scaff.log
So what is the resulting fasta file of the genome?
I have become confused on their two data dumps to us in 2016.
The resulting FASTA file is Ostrea_lurida.scafSeq
That is what I thought, but then what is scaffold.fa.fill ?
That's the initial assembly using only the small insert libraries. You'll notice that there's a TON of missing sequence (i.e. NNNNNNN) in that FASTA file because it hadn't been combined with the large insert libraries (which hadn't been sequenced when they produced that initial assembly).