Alternative format for chr_acc

vineetbansal commented 1 year ago

Thanks. I think I took some shortcuts in the original implementation that might as well be fixed while we're doing this.

Lines may start with a # to indicate a comment. All such lines should be ignored (the csv module may have a flag for this), except the first one which should be interpreted as the header.
The header line should have all leading # stripped off. The resulting string should be lowercased and tokenized on a \t (so the column names found would be chromosome or ucsc instead of #Chromosome or # ucsc).
The chr prefix for the common name should be stripped off in both cases (I'm not sure which formats uses it and which doesn't and whether its documented, but we should just take care of it in the script regardless).

Again, it might well be that the csv module can handle all these cases.

vineetbansal commented 1 year ago

Thanks @IsleZhu for the work. I didn't realize that perhaps the csv module is not flexible enough (in terms of ignoring commented out lines, providing alternate headers etc). Plus this PR is taking certain approaches that I think are anti-patterns. Take a look at PR #62, which is how I would implement this. This might be a good instructive example of how to write ones' own parsing routine. Let's also try to write tests for any new code we write (better late than never).

If you're happy with PR #62 after reviewing it, let's close this here and merge that instead. Any suggestions welcome of course.

1rzhu commented 1 year ago

Thank you for the improvement. They are very helpful for learning. And yes, I will also take care about the tests.

pritykinlab / guidescanpy

Alternative format for chr_acc #59