Closed copiousfreetime closed 4 years ago
Thanks a lot for the detailed troubleshooting!
I think we definitely need some cleanup here. I don't think a newline character is to be expected in the keywords although it's understandable how and why it can be there (this is a keyword suggested by a user, so user input).
I'll make sure we do some cleanup and maybe just transform newline characters into spaces for the next version. š
The keywords data file appears to have an embedded newline in one of the records. I just want to clarify if this is expected or not. It looks like the given psql loading instructions do account for newlines in the TSV file, but if folks are processing the file outside of that without using quote-escaping rules they may process the data incorrectly.
To Reproduce
Load the according to the documented instructions:
Check the db row count
Expected behavior
I initially expected there to be 1 record for each non-header line of TSV, this appears to be an incorrect assumption. It looks like the psql commandline parsed the TSV according to quoted escape rules, so that is good.
I wrote a program to check the keywords file and it reports
Then looking at the lines around line 1590610 we see:
And the db reports that row and the preceding and following rows correctly loaded.
If folks are processing these TSV simplistically without using quote-escaping logic then they may process the files incorrectly. I don't want folks to encounter that. And maybe this points to and upstream data input issue, if users are entering newlines in the keyword input - how are they getting processed in the main app.
We may just want to document that there can be embedded newlines in the TSV files.
Thanks!