waldronlab / curatedMetagenomicDataCuration

Sample Metadata Curation for curatedMetagenomicData
https://waldronlab.io/curatedMetagenomicDataCuration/
28 stars 23 forks source link

why in CosteaPI_2017 some samples only single end? #63

Closed luzhang321 closed 2 years ago

luzhang321 commented 2 years ago

Hi :)

I fina a difference between ena file and sampleMetadata file in CosteaPI_2017

Here is the example from sample : 713A002-11-0-0 grep 713A002-11-0-0 CosteaPI_2017_KAZ* |less -S |cut -f 1-4,21 CosteaPI_2017 SID713A002-11-0-0 KAZ1 stool ERR1728788;ERR1728787;ERR1728786;ERR1728785

there are only 4 ERRs recorded here. and all of them are single-end.

However when I look into the ena file, 8 enas are recorded. why the pair-end samples are excluded? Cause in the paper, "Library generation and shotgun sequencing were carried out on the Illumina HiSeq 2000/2500 (Illumina, San Diego, CA, USA) platform. All samples were paired-end-sequenced with 100 bp read." they claimed that all of their samples are pair-end.

grep 713A002-11-0-0 filereport_read_run_PRJEB17632_tsv.txt|cut -f 1-6,14,49
PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1797591 ERR1727663 PAIRED 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1797592 ERR1727664 PAIRED 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1797593 ERR1727665 PAIRED 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1797594 ERR1727666 PAIRED 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1798713 ERR1728785 SINGLE 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1798714 ERR1728786 SINGLE 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1798715 ERR1728787 SINGLE 713A002-11-0-0 PRJEB17632 ERP019502 SAMEA4545280 ERS1444459 ERX1798716 ERR1728788 SINGLE 713A002-11-0-0

Why are the 4 pair-end ERRs excluded?

hjruscheweyh commented 2 years ago

Dear @luzhang321

have you solved this issue? I'm also trying to work with this dataset and also run into the same issue with runs missing. There are 1122 Paired-end runs in the Bioproject but only ~400 are recorded in CosteaPI_2017. It seems that there are run files missing.

The reason why there is also Single end files is because the reads were processed with MOCAT which uses paired-end reads and does QC on it. Some reads lose their partner and become "singletons", here recorded as SINGLE.

Best, hans

lwaldron commented 2 years ago

@paolinomanghi can you comment? The relevant curation file is https://github.com/waldronlab/curatedMetagenomicDataCuration/blob/master/inst/curated/CosteaPI_2017/CosteaPI_2017_metadata.tsv

hjruscheweyh commented 2 years ago

Hi All

I have created an updated mapping file between samples and runs that uses all paired-end runs (but excludes the singleton files).

Keep in mind that some runs are tagged as paired but when downloaded/extracted using the sra toolkit they become single-end (no clue why). See below for accessions:

ERR1727403
ERR1727404
ERR1727405
ERR1727406
ERR1727428
ERR1727550
ERR1727551
ERR1727552
ERR1727553
ERR1727642
ERR1727643
ERR1727644
ERR1727683
ERR1727740
ERR1727766
ERR1728342
ERR1728343
ERR1728344
ERR1728345

My mapping file between samples and runs is below:

CosteaPI_2017.txt

lwaldron commented 2 years ago

Pinging @paolinomanghi - would be great to fix this before the Bioconductor 3.15 release.

paolinomanghi commented 2 years ago

Thanks @luzhang321 and @hjruscheweyh for your valuable work.