groupschoof / AHRD

High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms.
https://www.cropbio.uni-bonn.de/
Other
63 stars 21 forks source link

additional regex debug messages #9

Closed wmnwmn closed 8 years ago

wmnwmn commented 8 years ago

If possible, it would be helpful to provide additional debug messages for the regex parsing. It is very difficult to get it to work with only the "matches/doesn't match" output.

For example in my test at first there are no matches, just errors like this:

gi|351734478|ref|NP_001235791.1| uncharacterized protein LOC100527162 [Glycine max] does not match provided regular expression ^>(?<accession>\\S+)(?<description>.*)$

However if I just replace the \\S+ with the equivalent [^\\s]+ then it works: fasta_header_regex: '^>(?<accession>[^\\s]+)(?<description>.*)$'

However, even though it now appears to work, all of the proteins except for one come out like this: sp|Q84UP7|CSLF6_ORYSJ Unknown protein

But 'Unknown protein' is not a string that exists in the database, and I am not using any filtering: blacklist: /home/wmn/test/empty.txt filter: /home/wmn/test/empty.txt token_blacklist: /home/wmn/test/empty.txt

So it leaves me not knowing whether the regex is actually picking up the descriptions correctly.

How about a "debug mode" where it shows complete regex match information for each string? For example, what were <accession> and<description> in each case, and any other information that is available. Debugging these regex is rapidly turning into the most time-consuming part of my project, and I don't know how I can ever be sure that they are actually working right.

wmnwmn commented 8 years ago

Update, I'll play around with the code. If I can come up with some useful debug messages I will submit a patch.

wmnwmn commented 8 years ago

Patch submitted.