Closed AsierGonzalez closed 2 years ago
Comment of @yuukiiwa on Slack:
Thank you for the suggestion!! Yes, I do think the checking of the number of columns is feasible. I can add that.
Comment of @uniqueg on Slack:
Perhaps you could also check the types for the integer columns, the values of the strand column, and the reference sequence format by a regex as @Asier Gonzalez suggested. If there's a particular requirement on the width of the ranges or the values of the score column, these could be validated, too, I think.
Hi @AsierGonzalez, I have updated the docker with the validations you and @uniqueg suggested: https://hub.docker.com/r/apaeval/q2_validation/tags?page=1&ordering=last_updated Thank you!!
Hi @yuukiiwa, I confirm that the new validation works as expected. I have tried it with modified chromosome names (added 'chr' before the number) and with an extra column and it failed both times. I think it would be a good idea to document this somewhere in case you want to support non-human BED files with other chromosome names, which would require changes to the validation.
Thanks!
Also, as I suggested in #147, I think that using versioning would be beneficial here.
As per the discussion on Slack and the meeting today:
It seems that at the moment there no validation of the input is performed at all. It seems that in the initial version of the validation script it would check if the name of the "feature", that is, column 4 in the input BED file, appeared in the reference file (lines 53 and 70-71 in validation.py from Friday 4th June). This was too stringent, so @yuukiiwa removed it temporarily (last commit on Friday 4th June).
This is not an issue as such, but we suggest that at least the format of the input file be validated (e.g. number of columns, notation of chromosomes, ...). This is what we consider the minimum validation required.