nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
868 stars 695 forks source link

More flexible samplesheet #508

Open grst opened 3 years ago

grst commented 3 years ago

I think it would be great if the samplesheet could handle additional columns (which would be ignored by the RNA-seq pipeline, but would be used by some downstream analysis). Keeping these in a separate file leaves potential for human errors.

It would also be cool if the columns could be in arbitrary order.

To this end, it could maybe help to either use the stdlib csv parser or pandas read_csv instead of manually splitting the lines.

Potentially, these libraries would also be immune to "hidden carriage returns" (what came up on slack recently) or other caveats like quoted fields etc.

drpatelh commented 3 years ago

The pipeline should be able to handle additional columns already @grst.

Not convinced arbitrary order is a good thing though mainly for standardisation but happy to be persuaded otherwise.

Yes, we definitely need a more mainstream solution to read in the CSV files. I don't use pandas and other tools very much but PRs welcome ;) I was scrimping on having additional dependencies which is why I also went for a native solution but that shouldn't be a valid reason!

grst commented 3 years ago

The pipeline should be able to handle additional columns already @grst.

Cool! I could have sworn there was an issue the last time I tried, but probably I didn't add the additional columns at the end. I'll try it again next time!

Yes, we definitely need a more mainstream solution to read in the CSV files. I don't use pandas and other tools very much but PRs welcome ;)

I can definitely put something together when I find some time. I understand the concern with the dependencies. csv is in the standard library, so it wouldn't need any dependencies either. I have never used it in favor of pandas, though. It's probably slower than pandas, but that shouldn't be a concern for samplesheets.

grst commented 3 years ago

Some more thoughts:

drpatelh commented 3 years ago

Anything fancy we can do here @grst to improve our current implementation? We have now added a stand-alone samplesheet schema to help with the validation in https://github.com/nf-core/rnaseq/pull/623 but there are still some fundamental issues that may require outside scripts for this - some discussion in https://github.com/nf-core/rnaseq/pull/633

grst commented 3 years ago

I'll take a look at the issues you referenced... First step would be a reproducible example for the utf8-bug, though :thinking:

drpatelh commented 3 years ago

Yeah, I know. That 🐛 is really annoying and I suspect it will be quite an easy fix. Just haven't found a portable way to test it!

grst commented 3 years ago

The json-scheme and groovy-based validation look neat! That would make the python script obsolete anyway?

You mentioned the stripping of quotes is done in python, but there should be some way of achieving this in Groovy?

header = [x.strip('"') for x in fin.readline().strip().split(",")] 
drpatelh commented 3 years ago

Yup, most of the Python stuff would be obsolete!

I think the problem is that the validation needs to happen with standard libraries before it is passed to the pipeline in order to make it more portable but it doesn't have this option. Maybe @ewels can confirm.

grst commented 3 years ago

Additional java/groovy jars could be placed in lib as part of the pipeline template...

Midnighter commented 2 years ago

@grst maybe you could take a look at https://github.com/nf-core/tools/pull/1282 and see if that would fit your needs? I agree that doing the job in Groovy is preferable and consider my PR only a temporary change if it is at all wanted.

grst commented 2 years ago

Looks neat! I lost track of the discussion around the "samplesheet schema". Is your implemenation already part of that or would the schema validation be the next step?

Midnighter commented 2 years ago

@ewels showed my his work on the samplesheet schema on Friday but this is not it, I'm afraid.