robinst / linkify

Rust library to find links such as URLs and email addresses in plain text, handling surrounding punctuation correctly
https://robinst.github.io/linkify/
Apache License 2.0
201 stars 12 forks source link

URL discovery in CSV files where values are not wrapped in quotes #68

Closed cicdguy closed 6 months ago

cicdguy commented 9 months ago

This is a reference to the issue from https://github.com/lycheeverse/lychee/issues/1299, and it was suggested that I post here for feedback.


Hello,

I'm using lychee 0.13.0, which in turn is using v0.10.0 of linkify (see here) and running it against this file: https://github.com/pharmaverse/admiraldiscovery/blob/06e6e55b884ef91de9ae457606ed66defc9dba14/data-raw/admiral-lookup-book.csv

Like so:

lychee **/*.csv

And I get the following result:

⠚ 1/47 ETA 80s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed
⠚ 2/47 ETA 39s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Ne
⠚ 3/47 ETA 25s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network
⠚ 4/47 ETA 19s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Networ
⠚ 5/47 ETA 15s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network
⠚ 6/47 ETA 12s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Netwo
⠚ 7/47 ETA 10s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed
⠚ 8/47 ETA 9s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Net
⠚ 9/47 ETA 8s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network
⠚ 10/47 ETA 7s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network
⠚ 11/47 ETA 6s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Ne
⠚ 12/47 ETA 6s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Netw
⠚ 13/47 ETA 5s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network
⠚ 14/47 ETA 4s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network e
⠚ 15/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed:
⠚ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 17/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Netwo
⠒ 18/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Netwo
⠒ 19/47 ETA 1s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Netwo
⠒ 20/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network e
⠒ 21/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: N
⠒ 22/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Fai
⠒ 23/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network e
⠒ 24/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Ne
⠒ 25/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network
⠒ 26/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Net
⠒ 27/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network
⠒ 32/47 ETA 0s █████████████░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Netwo
⠒ 33/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed:
⠒ 34/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: N
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠒ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
  47/47 ETA 0s ████████████████████ Finished extracting links                                                                               Issues found in 1 input. Find details below.

[data-raw/admiral-lookup-book.csv]:
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_query.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_wbc_abs.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_obs_number.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_pchg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Network error: Not Found

🔍 47 Total ✅ 12 OK 🚫 35 Errors (HTTP:35)

When I modify the file by adding quotes around the URLs in the CSV, I get the correct expected result.

❯ lychee **/*.csv
  47/47 ETA 0s ████████████████████ Finished extracting links           
  🔍 47 Total ✅ 47 OK 🚫 0 Errors

Although commas are allowed/safe characters in URLs, will it be possible for linkify to detect CSV files and extract URLs from it without having to wrap the URL strings in quotes?

mre commented 7 months ago

@robinst, what are your thoughts on this? Is this out of scope for linkify?

robinst commented 7 months ago

Hmm. I wouldn't know how to distinguish this from a plain text case, linkify doesn't even know the file extension.

Does lychee have support for detecting file types via extension? That would help in this case.

mre commented 7 months ago

Yes, it does. Maybe it could be passed as a parameter to linkify, although I could see why one would not want to do that. Tricky one. Not sure where to draw the line between the tools.

robinst commented 6 months ago

I think in this case it would be nice if lychee could detect csv, use a parser library to parse it and then feed individual cell values to linkify.

mre commented 6 months ago

That's a great idea and I think that's a solid way forward. Thanks for the insight! I'll update the original issue accordingly.