Closed 1rzhu closed 1 year ago
Thanks @IsleZhu for the work. I didn't realize that perhaps the csv
module is not flexible enough (in terms of ignoring commented out lines, providing alternate headers etc). Plus this PR is taking certain approaches that I think are anti-patterns. Take a look at PR #62, which is how I would implement this. This might be a good instructive example of how to write ones' own parsing routine. Let's also try to write tests for any new code we write (better late than never).
If you're happy with PR #62 after reviewing it, let's close this here and merge that instead. Any suggestions welcome of course.
Thank you for the improvement. They are very helpful for learning. And yes, I will also take care about the tests.
Thanks. I think I took some shortcuts in the original implementation that might as well be fixed while we're doing this.
#
to indicate a comment. All such lines should be ignored (thecsv
module may have a flag for this), except the first one which should be interpreted as the header.#
stripped off. The resulting string should be lowercased and tokenized on a\t
(so the column names found would bechromosome
orucsc
instead of#Chromosome
or# ucsc
).chr
prefix for the common name should be stripped off in both cases (I'm not sure which formats uses it and which doesn't and whether its documented, but we should just take care of it in the script regardless).Again, it might well be that the
csv
module can handle all these cases.