overcast07 / wayback-machine-spn-scripts

Bash scripts which interact with Internet Archive Wayback Machine's Save Page Now
MIT License
101 stars 9 forks source link

Last line in index.txt ends with an LF which gets carried over into success.log #24

Closed deepspaceaxolotl closed 1 year ago

deepspaceaxolotl commented 1 year ago

Noticed while lexicographically sorting success.log files to compare them to the input files and check everything had been saved that one line in each would be missorted, due to the line before it ending with an LF instead of a CRLF. After some investigating, found that this was due to those lines being the last in the index.txt files and that for some reason they thus end in LF and not CRLF.

Thanks for this amazing piece of software!

deepspaceaxolotl commented 1 year ago

Important clarification: I ran the program in WSL and opened the files using Notepad++ to see the line endings and to sort.

overcast07 commented 1 year ago

I haven't used this script in WSL for a while, but I think the character change might ultimately be caused by the input files you're using containing CRLF line breaks instead of LF. The script doesn't change CRLF to LF or vice versa, but most of the programs and shell built-ins called by the script do output text with LF at the end of the output.

At lines 251–252 in spn.sh, the list of URLs is processed using awk to remove duplicates before being written to index.txt. I think either awk or echo would cause the text to end with LF (as opposed to ending with CRLF or ending without a line break).

You could potentially resolve your workflow issue by using line break-agnostic software to sort the URLs, e.g. by pasting the list of URLs into a spreadsheet program so that you don't have to deal with line break characters in the first place.

The presence of the CRLF characters doesn't seem to affect the main operation of the script, so it might not be necessary to change anything. It would probably be possible to convert all CRLF to LF when the script starts, but I'm not sure if this would actually be desirable.

deepspaceaxolotl commented 1 year ago

Thank you for the detailed response!

It's definitely a minor issue, just a bit of a peeve in some circumstances, when working with both Windows and Linux.

Wondering, is there a reason for the index.txt and other files to end in line breaks rather than no line breaks?

overcast07 commented 1 year ago

They are just added by default and expected by most Bash built-ins and most Unix command line programs, because text files and text strings are defined as ending with LF by the POSIX standards. It wouldn't make sense to deliberately remove them.

deepspaceaxolotl commented 1 year ago

I didn't know that! Thanks for explaining. I'm going to close the issue, but it's going to be here in case someone runs into a similar problem to what I did when working with not just Linux.