tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

Pre-filtering through pipe connections does not work on remote files #508

Closed DiabbZegpi closed 11 months ago

DiabbZegpi commented 1 year ago

Trying to use the pre-filtering capacity of awk or grep through pipe() connections works in a flawless fashion on local text files, BUT this functionality does not seem to work on remote files.

For reproducibility, this is an example with the UK's cats dataset from the Tidy Tuesday Project. The code below attempts to filter for those rows in which the first field has the string "Ares", but returns an empty tibble.

library(vroom)
cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"

# Example with `awk`
vroom(
  file = pipe(paste("awk -F ',' '$1 ~ /Ares/'", cats_file)),
  delim = ","
)

# Same filter but with `grep`
vroom(
  file = pipe(paste("grep '^Ares,'", cats_file)),
  delim = ","
)

Downloading the csv and performing the same operation, but with the local path, works smoothly. Am I making a mistake? What is the correct way of performing pre-filtering of remote text files with vroom?

jennybc commented 11 months ago

Yes, you can do this by using curl to stream the data then (unix) pipe that into grep, awk, or (see below) perl and put all of that into (R) pipe().

Here I'm using the techniques shown by @jimhester in this YouTube video: https://youtu.be/RYhwZW6ofbI?si=4uRdHcgsSBckd8n4&t=825

Note the last example, which gives a nicer result than the first two, because it also gets the column names.

library(vroom)
library(glue)

cats_file <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-01-31/cats_uk.csv"

cmd <- glue("curl -sL {cats_file} | grep '^Ares,'")
vroom(pipe(cmd), col_names = FALSE)
#> Rows: 103 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (2): X1, X11
#> dbl  (5): X2, X5, X6, X7, X8
#> lgl  (3): X3, X9, X10
#> dttm (1): X4
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 103 × 11
#>    X1           X2 X3    X4                     X5    X6    X7    X8 X9    X10  
#>    <chr>     <dbl> <lgl> <dttm>              <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#>  1 Ares     3.40e9 TRUE  2017-06-24 01:03:57 -5.11  50.2   684 155.  FALSE FALSE
#>  2 Ares     3.40e9 TRUE  2017-06-24 01:11:20 -5.11  50.2   936 155.  FALSE FALSE
#>  3 Ares     3.40e9 TRUE  2017-06-24 02:58:16 -5.11  50.2  2340  81.4 FALSE FALSE
#>  4 Ares     3.40e9 TRUE  2017-06-24 03:01:26 -5.11  50.2     0  67.8 FALSE FALSE
#>  5 Ares     3.40e9 TRUE  2017-06-24 03:51:58 -5.11  50.2  4896 118.  FALSE FALSE
#>  6 Ares     3.40e9 TRUE  2017-06-24 03:55:03 -5.11  50.2   504 123.  FALSE FALSE
#>  7 Ares     3.40e9 TRUE  2017-06-24 05:24:17 -5.11  50.2   108  27.1 FALSE FALSE
#>  8 Ares     3.40e9 TRUE  2017-06-24 05:27:23 -5.11  50.2   504  36.3 FALSE FALSE
#>  9 Ares     3.40e9 TRUE  2017-06-24 09:14:37 -5.11  50.2   252  67.4 FALSE FALSE
#> 10 Ares     3.40e9 TRUE  2017-06-24 09:17:42 -5.11  50.2  4068  78.9 FALSE FALSE
#> # ℹ 93 more rows
#> # ℹ 1 more variable: X11 <chr>

cmd <- glue("curl -sL {cats_file} | awk -F ',' '$1 ~ /Ares/'")
vroom(pipe(cmd), col_names = FALSE)
#> Rows: 103 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (2): X1, X11
#> dbl  (5): X2, X5, X6, X7, X8
#> lgl  (3): X3, X9, X10
#> dttm (1): X4
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 103 × 11
#>    X1           X2 X3    X4                     X5    X6    X7    X8 X9    X10  
#>    <chr>     <dbl> <lgl> <dttm>              <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
#>  1 Ares     3.40e9 TRUE  2017-06-24 01:03:57 -5.11  50.2   684 155.  FALSE FALSE
#>  2 Ares     3.40e9 TRUE  2017-06-24 01:11:20 -5.11  50.2   936 155.  FALSE FALSE
#>  3 Ares     3.40e9 TRUE  2017-06-24 02:58:16 -5.11  50.2  2340  81.4 FALSE FALSE
#>  4 Ares     3.40e9 TRUE  2017-06-24 03:01:26 -5.11  50.2     0  67.8 FALSE FALSE
#>  5 Ares     3.40e9 TRUE  2017-06-24 03:51:58 -5.11  50.2  4896 118.  FALSE FALSE
#>  6 Ares     3.40e9 TRUE  2017-06-24 03:55:03 -5.11  50.2   504 123.  FALSE FALSE
#>  7 Ares     3.40e9 TRUE  2017-06-24 05:24:17 -5.11  50.2   108  27.1 FALSE FALSE
#>  8 Ares     3.40e9 TRUE  2017-06-24 05:27:23 -5.11  50.2   504  36.3 FALSE FALSE
#>  9 Ares     3.40e9 TRUE  2017-06-24 09:14:37 -5.11  50.2   252  67.4 FALSE FALSE
#> 10 Ares     3.40e9 TRUE  2017-06-24 09:17:42 -5.11  50.2  4068  78.9 FALSE FALSE
#> # ℹ 93 more rows
#> # ℹ 1 more variable: X11 <chr>

cmd <- glue("curl -sL {cats_file} | perl -ne 'use English; print if $INPUT_LINE_NUMBER == 1 || /^Ares,/'")
vroom(pipe(cmd))
#> Rows: 103 Columns: 11
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (2): tag_id, study_name
#> dbl  (5): event_id, location_long, location_lat, ground_speed, height_above_...
#> lgl  (3): visible, algorithm_marked_outlier, manually_marked_outlier
#> dttm (1): timestamp
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 103 × 11
#>    tag_id   event_id visible timestamp           location_long location_lat
#>    <chr>       <dbl> <lgl>   <dttm>                      <dbl>        <dbl>
#>  1 Ares   3395610551 TRUE    2017-06-24 01:03:57         -5.11         50.2
#>  2 Ares   3395610552 TRUE    2017-06-24 01:11:20         -5.11         50.2
#>  3 Ares   3395610553 TRUE    2017-06-24 02:58:16         -5.11         50.2
#>  4 Ares   3395610554 TRUE    2017-06-24 03:01:26         -5.11         50.2
#>  5 Ares   3395610555 TRUE    2017-06-24 03:51:58         -5.11         50.2
#>  6 Ares   3395610556 TRUE    2017-06-24 03:55:03         -5.11         50.2
#>  7 Ares   3395610557 TRUE    2017-06-24 05:24:17         -5.11         50.2
#>  8 Ares   3395610558 TRUE    2017-06-24 05:27:23         -5.11         50.2
#>  9 Ares   3395610559 TRUE    2017-06-24 09:14:37         -5.11         50.2
#> 10 Ares   3395610560 TRUE    2017-06-24 09:17:42         -5.11         50.2
#> # ℹ 93 more rows
#> # ℹ 5 more variables: ground_speed <dbl>, height_above_ellipsoid <dbl>,
#> #   algorithm_marked_outlier <lgl>, manually_marked_outlier <lgl>,
#> #   study_name <chr>

Created on 2023-09-28 with reprex v2.0.2.9000