OpenEBench summary workflow: proper validation lacking - Githubissues

iRNA-COSI / APAeval

Community effort to evaluate computational methods for the detection and quantification of poly(A) sites and estimating their differential usage across RNA-seq samples

MIT License

13 stars 14 forks source link

OpenEBench summary workflow: proper validation lacking #144

Closed AsierGonzalez closed 2 years ago

AsierGonzalez commented 3 years ago

As per the discussion on Slack and the meeting today:

It seems that at the moment there no validation of the input is performed at all. It seems that in the initial version of the validation script it would check if the name of the "feature", that is, column 4 in the input BED file, appeared in the reference file (lines 53 and 70-71 in validation.py from Friday 4th June). This was too stringent, so @yuukiiwa removed it temporarily (last commit on Friday 4th June).

This is not an issue as such, but we suggest that at least the format of the input file be validated (e.g. number of columns, notation of chromosomes, ...). This is what we consider the minimum validation required.

AsierGonzalez commented 3 years ago

Comment of @yuukiiwa on Slack:

Thank you for the suggestion!! Yes, I do think the checking of the number of columns is feasible. I can add that.

AsierGonzalez commented 3 years ago

Comment of @uniqueg on Slack:

Perhaps you could also check the types for the integer columns, the values of the strand column, and the reference sequence format by a regex as @Asier Gonzalez suggested. If there's a particular requirement on the width of the ranges or the values of the score column, these could be validated, too, I think.

yuukiiwa commented 3 years ago

Hi @AsierGonzalez, I have updated the docker with the validations you and @uniqueg suggested: https://hub.docker.com/r/apaeval/q2_validation/tags?page=1&ordering=last_updated Thank you!!

AsierGonzalez commented 3 years ago

Hi @yuukiiwa, I confirm that the new validation works as expected. I have tried it with modified chromosome names (added 'chr' before the number) and with an extra column and it failed both times. I think it would be a good idea to document this somewhere in case you want to support non-human BED files with other chromosome names, which would require changes to the validation.

Thanks!

AsierGonzalez commented 3 years ago

Also, as I suggested in #147, I think that using versioning would be beneficial here.