lycheeverse / lychee

⚡ Fast, async, stream-based link checker written in Rust. Finds broken URLs and mail addresses inside Markdown, HTML, reStructuredText, websites and more!
https://lychee.cli.rs
Apache License 2.0
2.13k stars 129 forks source link

URL discovery in CSV files where values are not wrapped in quotes #1299

Open cicdguy opened 11 months ago

cicdguy commented 11 months ago

Hello,

I'm using lychee 0.13.0 and running it against this file: https://github.com/pharmaverse/admiraldiscovery/blob/06e6e55b884ef91de9ae457606ed66defc9dba14/data-raw/admiral-lookup-book.csv

Like so:

lychee **/*.csv

And I get the following result:

⠚ 1/47 ETA 80s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed
⠚ 2/47 ETA 39s ░░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Ne
⠚ 3/47 ETA 25s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network
⠚ 4/47 ETA 19s █░░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Networ
⠚ 5/47 ETA 15s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network
⠚ 6/47 ETA 12s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Netwo
⠚ 7/47 ETA 10s ██░░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed
⠚ 8/47 ETA 9s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Net
⠚ 9/47 ETA 8s ███░░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network
⠚ 10/47 ETA 7s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network
⠚ 11/47 ETA 6s ████░░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Ne
⠚ 12/47 ETA 6s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Netw
⠚ 13/47 ETA 5s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network
⠚ 14/47 ETA 4s █████░░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network e
⠚ 15/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed:
⠚ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 16/47 ETA 4s ██████░░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Networ
⠒ 17/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Netwo
⠒ 18/47 ETA 1s ███████░░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Netwo
⠒ 19/47 ETA 1s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Netwo
⠒ 20/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network e
⠒ 21/47 ETA 0s ████████░░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: N
⠒ 22/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Fai
⠒ 23/47 ETA 0s █████████░░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network e
⠒ 24/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Ne
⠒ 25/47 ETA 0s ██████████░░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network
⠒ 26/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Net
⠒ 27/47 ETA 0s ███████████░░░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network
⠒ 32/47 ETA 0s █████████████░░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Netwo
⠒ 33/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed:
⠒ 34/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: N
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠂ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
⠒ 35/47 ETA 0s ██████████████░░░░░░ ✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network
  47/47 ETA 0s ████████████████████ Finished extracting links                                                                               Issues found in 1 input. Find details below.

[data-raw/admiral-lookup-book.csv]:
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_query.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_analysis_ratio.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_rr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtemfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bsa.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr_dir.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_basetype_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_ontrtfl.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_chg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_anrind.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_wbc_abs.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_summary_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_map.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_obs_number.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_pchg.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged_lookup.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_joined.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_base.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_shift.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/restrict_derivation.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_bmi.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_tm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dtm_to_dt.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_param_qtc.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_atoxgr.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_merged.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_trtdurd.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_extreme_records.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_duration.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_vars_dy.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_merged_exist_flag.html,Template | Failed: Network error: Not Found
✗ [404] https://pharmaverse.github.io/admiral/reference/derive_var_extreme_dt.html,Template | Failed: Network error: Not Found

🔍 47 Total ✅ 12 OK 🚫 35 Errors (HTTP:35)

When I modify the file by adding quotes around the URLs in the CSV, I get the correct expected result.

❯ lychee **/*.csv
  47/47 ETA 0s ████████████████████ Finished extracting links           
  🔍 47 Total ✅ 47 OK 🚫 0 Errors

Although commas are allowed/safe characters in URLs, will it be possible for Lychee to detect CSV files and extract URLs from it without having to wrap the URL strings in quotes?

mre commented 10 months ago

Thanks for creating the issue. I think we should discuss that with the folks at linkify, which is the plaintext parser we use. I don't know if it will be an easy fix for them, though. 😕 Could you still open an issue over there and ask for feedback?

cicdguy commented 10 months ago

@mre - thank you for your response. Yes, definitely, I can open an issue there and request feedback.

mre commented 7 months ago

@robinst suggested using a CSV parser and pass individual cells to linkify. This has a few advantages:

I think this is the way forward. @cicdguy, perhaps you want to close the linkify issue again, and we can focus on fixing this issue in lychee itself as per the above plan? What do you think? 😃

cicdguy commented 7 months ago

Sounds great. Thank you @mre!