lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.22k stars 136 forks source link

URL discovery in CSV files where values are not wrapped in quotes #1299

Open cicdguy opened 1 year ago

cicdguy commented 1 year ago

Hello,

I'm using lychee 0.13.0 and running it against this file: https://github.com/pharmaverse/admiraldiscovery/blob/06e6e55b884ef91de9ae457606ed66defc9dba14/data-raw/admiral-lookup-book.csv

Like so:

lychee **/*.csv

And I get the following result:

⠚ 1/47 ETA 80s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed
⠚ 2/47 ETA 39s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Ne
⠚ 3/47 ETA 25s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network
⠚ 4/47 ETA 19s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Networ
⠚ 5/47 ETA 15s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network
⠚ 6/47 ETA 12s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Netwo
⠚ 7/47 ETA 10s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed
⠚ 8/47 ETA 9s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Net
⠚ 9/47 ETA 8s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network
⠚ 10/47 ETA 7s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network
⠚ 11/47 ETA 6s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Ne
⠚ 12/47 ETA 6s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Netw
⠚ 13/47 ETA 5s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network
⠚ 14/47 ETA 4s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network e
⠚ 15/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed:
⠚ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 17/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Netwo
⠒ 18/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Netwo
⠒ 19/47 ETA 1s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Netwo
⠒ 20/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network e
⠒ 21/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: N
⠒ 22/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Fai
⠒ 23/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network e
⠒ 24/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Ne
⠒ 25/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network
⠒ 26/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Net
⠒ 27/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network
⠒ 32/47 ETA 0s █████████████░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Netwo
⠒ 33/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed:
⠒ 34/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: N
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠒ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
  47/47 ETA 0s ████████████████████ Finished extracting links                                                                               Issues found in 1 input. Find details below.

[data-raw/admiral-lookup-book.csv]:
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_query.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_wbc_abs.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_obs_number.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_pchg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Network error: Not Found

🔍 47 Total ✅ 12 OK 🚫 35 Errors (HTTP:35)

When I modify the file by adding quotes around the URLs in the CSV, I get the correct expected result.

❯ lychee **/*.csv
  47/47 ETA 0s ████████████████████ Finished extracting links           
  🔍 47 Total ✅ 47 OK 🚫 0 Errors

Although commas are allowed/safe characters in URLs, will it be possible for Lychee to detect CSV files and extract URLs from it without having to wrap the URL strings in quotes?

mre commented 1 year ago

Thanks for creating the issue. I think we should discuss that with the folks at linkify, which is the plaintext parser we use. I don't know if it will be an easy fix for them, though. 😕 Could you still open an issue over there and ask for feedback?

cicdguy commented 1 year ago

@mre - thank you for your response. Yes, definitely, I can open an issue there and request feedback.

mre commented 9 months ago

@robinst suggested using a CSV parser and pass individual cells to linkify. This has a few advantages:

I think this is the way forward. @cicdguy, perhaps you want to close the linkify issue again, and we can focus on fixing this issue in lychee itself as per the above plan? What do you think? 😃

cicdguy commented 9 months ago

Sounds great. Thank you @mre!