wb2osz / direwolf

Dire Wolf is a software "soundcard" AX.25 packet modem/TNC and APRS encoder/decoder. It can be used stand-alone to observe APRS traffic, as a tracker, digipeater, APRStt gateway, or Internet Gateway (IGate). For more information, look at the bottom 1/4 of this page and in https://github.com/wb2osz/direwolf/blob/dev/doc/README.md
GNU General Public License v2.0
1.55k stars 302 forks source link

^M carriage return in logs, dti row #317

Open nayrnet opened 3 years ago

nayrnet commented 3 years ago

I'm trying to import this data into a database for view and keep running into this issue and having to fix these lines by hand.. both ^M is being seen as 1 char in a text editor, i can back it out and replace with ^M and its fine afterward.

cat direwolf.log | iconv -f utf-8 -t utf-8 -c | dos2unix -f | sed s/^M// | psql -h localhost -p 5432 -U testuser direwolf -c "COPY logs (chan,utime,isotime,source,heard,level,error,dti,name,symbol,latitude,longitude,speed,course,altitude,frequency,coffset,tone,system,status,telemetry,comment) FROM STDIN DELIMITER ',' CSV HEADER;"
ERROR:  unquoted carriage return found in data
HINT:  Use quoted CSV field to represent carriage return.
CONTEXT:  COPY logs, line 451057

Screen Shot 2021-02-23 at 6 33 13 PM

nayrnet commented 3 years ago

also getting carriage returns elsewhere in the logs, it is a real struggle to do anything with a csv that's not been properly sanitized before writing.

Screen Shot 2021-02-23 at 6 54 29 PM

dranch commented 3 years ago

Your first packet that has the ^M in it is not valid (see the destination callsign with a "?" in it. Are you using Direwolf's FIX_BITS feature? If so, turn them off. Even without using FIX_BITS, there is a none-zero chance that a packet will match the CRC check yet be broken. Maybe Direwolf could improve on it's output sanitation but it will never be perfect. I would recommend to have your application also sanitize it's input as well. Next up, if you read page 20 on APRS comments at http://www.aprs.org/doc/APRS101.PDF , it seems like carriage returns ARE allowed (not explicitly forbidden).

nayrnet commented 3 years ago

I am sanitizing it on my app that watches the log and drops em into the DB in real time, but I'm trying to bulk import over 8 million rows into the database and that is a huge overhead on that end..

While a carriage return in a string is allowed in APRS, it is not allowed in CSV spec.. if these were raw logs tha'd be one thing but your outputting a standardized format it should be compliant otherwise this bespoke Direwolf CSV format you invented is not really useful if just about any tool to read them will error out. https://tools.ietf.org/html/rfc4180

I can accept losing a random packet that cant be parsed every once and in a blue moon, but as a station operator I'm responsible for my station and I believe having complete and accurate logs are important and I would much rather be assured that the logs I'm scraping wont keep coming up with new ways to deviate from standards over time..

dranch commented 3 years ago

Interesting that CSV doesn't allow carriage returns. I suppose if Direwolf to strictly follow the RFC, this would need to be changed. Not sure what WB2OSZ would want to do here as to be technically accurate, the ^M should stay as that's what was potentially received.

nayrnet commented 3 years ago

I'm fine with them being escaped, just needed quotes put around the field and Postgres would take it as it is.. I would rather have more accurate data, then they can be displayed escaped or as intended on the final output, but for data handling the log file should conform to csv standards, not the incoming protocol standards.

I would love to also have a raw log file that is unprocessed packets, faults and all in addition to the csv. The existing CSV file we are already losing information in the conversion from a packet into a series of objects as direwolf processed it, so I see no loss at all linting the csv output so its valid, encoded properly, and consumable by anything that can parse csv files.

This would also make it trivial to transform the data into other formats as needed, such as json/yaml/xml/etc but if it wont even pass its own standards there is a rough road ahead for anyone wanting to transform it into a new format.