qiime2 / Keemei

Validate tabular bioinformatics file formats in Google Sheets
https://keemei.qiime2.org
BSD 3-Clause "New" or "Revised" License
5 stars 20 forks source link

Add option to check for R1 and R2 matches for paired end data #75

Closed shiffer1 closed 7 years ago

shiffer1 commented 7 years ago

Add a quick parse of sample names that make sure both R1 and R2 data are there and match. This is a horrible thing to search for. Thanks, Arron

jairideout commented 7 years ago

Thanks @shiffer1! I'm not sure I understand what you're asking. Keemei validates QIIME mapping files (and a couple of other formats). The sample ID column doesn't contain sample IDs for R1 and R2, it only lists a single sample per row. And there isn't associated sequence data in the file format (it's only metadata). Typically you wouldn't want to include separate samples for R1 and R2 data, since those data logically represent a single sample.

Are you imagining a new file format that Keemei would validate? Or can you provide an example of how you're put this information in a QIIME mapping file?

shiffer1 commented 7 years ago

Hey Jai, I was thinking something alone the lines of it verifying that both R1 and R2 are there. The example would be like 2017GWARhere_L001_R1.fastq.gz 2017GWARhere_L001_R2.fastq.gz

I was thinking that if there is a sequence that has R2 the R1 could be verified. I take your point about them having different names other than the R1 - R2 difference though. So maybe its not such a good idea then. Thanks, Arron

On Mon, Feb 13, 2017 at 12:06 PM, Jai Ram Rideout notifications@github.com wrote:

Thanks @shiffer1 https://github.com/shiffer1! I'm not sure I understand what you're asking. Keemei validates QIIME mapping files (and a couple of other formats). The sample ID column doesn't contain sample IDs for R1 and R2, it only lists a single sample per row. And there isn't associated sequence data in the file format (it's only metadata). Typically you wouldn't want to include separate samples for R1 and R2 data, since those data logically represent a single sample.

Are you imagining a new file format that Keemei would validate? Or can you provide an example of how you're put this information in a QIIME mapping file?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/Keemei/issues/75#issuecomment-279489588, or mute the thread https://github.com/notifications/unsubscribe-auth/AJuoUteNnn4WuFDBWOsyhUUTGcWDF4_cks5rcKm8gaJpZM4L95lM .

jairideout commented 7 years ago

Thanks for the details! I'm not sure where this would fit into the currently supported file formats since their format specifications don't understand file naming conventions. We could create a new file format specifically for this case, but it'd probably be more work than it's worth to have a user import all of their per-sample paired-end filenames into a Google Sheet in order to validate. It'd be easier to perform this type of validation on the command line with Unix commands, or write a small utility script that validates a directory of these files (I recommend the latter).