kellyjonbrazil / jc

CLI tool and python library that converts the output of popular command-line tools, file-types, and common strings to JSON, YAML, or Dictionaries. This allows piping of output to tools like jq and simplifying automation scripts.
MIT License
7.91k stars 210 forks source link

Split Function: line numbers different #522

Closed muescha closed 9 months ago

muescha commented 10 months ago

Where

I can set the split function as parameter in jc like "3:10"

bat test.out
───────┬───────────────────────────────
       │ File: test.out
       │ Size: 83 B
───────┬───────────────────────────────
   1   │ # Abc
   2   │
   3   │ # DEF
   4   │
   5   │ a:b
   6   │ c:d
   7   │ e:f
   8   │
   9   │ g:h
  10   │ i:k
  11   │
  12   │ More comments
  13   │ more ocmments
  14   │ Summary - abce: 123
───────┬───────────────────────────────

I expect to cut it with "5:10" and then processed by jc:

cat test.out | jc --kv -p "5:10"
{
  "g": "h",
  "i": "k",
  "More comments": "",
  "more ocmments": "",
  "Summary - abce": "123"
}

ok - maybe it is zero based... then

cat test.out | jc --kv -p "4:9"
{
  "e": "f",
  "g": "h",
  "i": "k",
  "More comments": "",
  "more ocmments": ""
}

But it looks like the split also not count the empty lines:

cat test.out | jc --kv -p "2:7"
{
  "a": "b",
  "c": "d",
  "e": "f",
  "g": "h",
  "i": "k"
}
kellyjonbrazil commented 10 months ago

It looks like the _lazy_splitlines function (https://github.com/kellyjonbrazil/jc/blob/dev/jc/utils.py#L397) is skipping blank lines. I'll need to find a way around that.

kellyjonbrazil commented 10 months ago

I added a small fix to _lazy_splitlines and it seems to work better. I'll put this in dev so you can test or you can modify the two lines manually:

def _lazy_splitlines(text: str) -> Iterable[str]:
    NEWLINES_PATTERN: str = r'(\r\n|\r|\n)'
    NEWLINES_RE = re.compile(NEWLINES_PATTERN)
    start = 0
    for m in NEWLINES_RE.finditer(text):
        begin, end = m.span()
        if begin != start:
            yield text[start:begin]
        else:                                  # add this line
            yield ''                           # add this line
        start = end

    if text[start:]:
        yield text[start:]
kellyjonbrazil commented 10 months ago

https://github.com/kellyjonbrazil/jc/commit/da28ff7a0b3160f5260bb3a314b98c44dc04d735

muescha commented 10 months ago

It works as expected.

I am new with Python Array Slicing™ - it is confusing for me. Zero-based and excluding the end index.

I expected writing `"5:10" or "4:9" but I need to write "4:10"

bat test.out | jc "4:10" --kv -p
{
  "a": "b",
  "c": "d",
  "e": "f",
  "g": "h",
  "i": "k"
}

But that is ok, when it is normal with python. Maybe this behaviour should be documented in doc and on command line help for users not familiar with the python slicing?

muescha commented 10 months ago

is this right in this case - here I would expect 100?

    Line Slicing:
-        $ cat file.csv | jc :101 --csv    # parse first 100 lines
+        $ cat file.csv | jc :100 --csv    # parse first 100 lines
muescha commented 10 months ago

also this help is then confusing:

         data:              (string or iterable) - input to slice by lines
-        slice_start:       (int) - starting line
+        slice_start:       (int) - starting line (zero based)
-        slice_end:         (int) - ending line
+        slice_end:         (int) - ending line (+1)
muescha commented 10 months ago

or it is better to use 1 based and the ending line (as I original expected) with the "5:10"? Starting Line: 5 until Ending Line: 10

kellyjonbrazil commented 10 months ago

Here are some explanations as to why Python slicing works the way it does. There is an elegance factor that maybe only a programmer would see.

https://stackoverflow.com/questions/11364533/why-are-slice-and-range-upper-bound-exclusive

kellyjonbrazil commented 10 months ago

The slicing behavior is documented in the readme and man page but we can probably add more.

In the csv example in help I use :101 to account for the zero start and the header row.

muescha commented 10 months ago

In the csv example in help I use :101 to account for the zero start and the header row.

The current example is a bit confusing. I recommend using a clearer illustration. Understanding the slicing index here requires knowing that the header row is not counted.

Consider this alternative:

$ cat output.txt | jc 4:15 --parser    # Parse from line 4 to 14 with parser (zero-based)

Additionally, it might be helpful to include an explanation of the SLICE option in the jc --help command:

Slice:
  [start]:[end] 
    start: [[-]index] - Zero-based start line, negative index for counting from the end
    end: [[-]index] - Zero-based end line (excluding the index), negative index for counting from the end

Maybe this provides a clearer and more detailed explanation.

kellyjonbrazil commented 10 months ago

Agreed - definitely room for improvement. I can add these doc updates.

kellyjonbrazil commented 9 months ago

Added in v1.25.0