mattgodbolt / zindex

Create an index on a compressed text file
BSD 2-Clause "Simplified" License
620 stars 37 forks source link

Support tab characters for -d #26

Open mattgodbolt opened 7 years ago

mattgodbolt commented 7 years ago

Seems no matter how you try, you can't pass a tab character to -d

mattgodbolt commented 7 years ago

Tested with:

~/dev/zindex/build/cmake-build-debug/zindex --delimiter \t --field 1 data.gz --debug --verbose
jim892 commented 7 years ago

I downloaded zindex on 2/15/2017
Seems no matter how I try, I can't pass a tab character to -d or to --delimiter The example in comment from Nov 24, 2016 ( with --delimiter \t ) does not seem to work.

$ zindex -f 2 --delimiter \t --debug tab10.gz

Indexing... Debug: Creating checkpoint at 0 bytes (compressed offset 16 bytes) Progress: 16 bytes of 107 bytes (14.95%) Index building complete; creating line index Debug: Indexing line '1 10583 G A PASS' Debug: Indexing line '1 10583 G A PASS'

========================================== comma delimited: -d , works fine:

$ zindex -f 2 -d , --debug comma10.gz ... Debug: Creating checkpoint at 0 bytes (compressed offset 18 bytes) Progress: 18 bytes of 109 bytes (16.51%) Index building complete; creating line index Debug: Indexing line '1,10583,G,A,PASS' Debug: Found key '10583' Debug: Indexing line '1,10583,G,A,PASS' Debug: Found key '10583'

mattgodbolt commented 7 years ago

Hi,

which shell are you using? it's feasible something else is interpreting the \t ? Can you try --delimiter \\t ?

I should probably add a --tab-delimiter option to make this easier :/

If that all fails you can try making a JSON description of the index you need.

jim892 commented 7 years ago

Hi Matt,

My version:

$ bash --version GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

Unfortunately, none of these work

Seems odd, because "\t" works fine for awk

$ zindex  -f 2 --debug --delimiter \t tab10.gz $ zindex  -f 2 --debug --delimiter \t tab10.gz $ zindex  -f 2 --debug --delimiter "\t" tab10.gz $ zindex  -f 2 --debug --delimiter "\t" tab10.gz

So, I built a file called "tab"...

$ cat tab { "indexes": [ { "type": "field", "delimiter": "\t", "fieldNum": 2 } ] }

now this works!  finally!!!

$ zindex  -f 2 --debug --config tab tab10.gz

############### Thanks for your help with this!  I think the --tab-delimiter option is a GREAT idea.  Also, the --config option does not show in the --help message or in the short list of options in the USAGE section.

NEW QUESTION: (I am new to JSON but not new to MySQL)  Is there a way to combine two fields into a key?   For example, I have a .gz file where the first two (out of 360) columns looks like this (tab-delimited of course):

10 12345 10 23456 10 1234567 11 2345678 X 12345 Y 123456

I'd like to retrieve lines where col1=n and col2=m THANKS,

Jim Perry

-------- Original Message -------- Subject: Re: [mattgodbolt/zindex] Support tab characters for -d (#26) From: Matt Godbolt notifications@github.com Date: Fri, February 17, 2017 5:20 pm To: mattgodbolt/zindex zindex@noreply.github.com Cc: jim892 jim@whisperworks.com, Comment comment@noreply.github.com

Hi, which shell are you using? it's feasible something else is interpreting the \t ? Can you try --delimiter \t ? I should probably add a --tab-delimiter option to make this easier :/ If that all fails you can try making a JSON description of the index you need. —You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.

mattgodbolt commented 7 years ago

Hi Jim,

Your reply seems to have gotten squashed up a bit so it's a little tricky to understand, so forgive me if I miss something.

  1. weird re '\t' still not working. I think the command-line parsing library I use is swallowing it up as whitespace. I'm reopening this bug to track it.
  2. I'll add a --tab-delimiter option as it seems a common-enough option.
  3. My locally-built zindex shows the option: -c <indexes>, --config <indexes> Create indexes using json config file <file>

Re: your question, it's a little tricky to do that but you'd need to create two indexes, which can only be done by the json configuation. Additionally then querying the index would need to be done with zq --raw.

Creating two indices should be straightforward, just use two entries in the json:

{
  "indexes": [
    {
      "type": "field",
      "delimiter": "\t",
      "fieldNum": 1,
      "name": "first"
    },
    {
      "type": "field",
      "delimiter": "\t",
      "fieldNum": 2,
      "name": "second"
    }
  ]
}

then crafting a --raw query like q --raw 'select a.line from index_first a, index_second b where a.line == b.line and a.key == "aa" and b.key =="bb"' path/to/thing.gz

jim892 commented 7 years ago

Hi Matt, Thanks for the help with the two indexes and also the good example!
I cleaned up the squashed posting so you can see my example now. Could I also use a regex to specify a single index? May something like: /^.{1,2}\t\d+/ This regex covers the first 2 "fields". But, I wonder if it will work to have a regex like this that goes across the first two tab-delimited fields? When I retrieve, I will want the entire line, not just certain fields in the line, so the file could be treated as lines with only one "very long" field. Would that work? Can there be no delimiter?

Regards, Jim

mattgodbolt commented 7 years ago

(thanks for the edited post- much clearer now!)

You can definitely use a regex. There's no notion of 'fields' with a regex, you just define something which matches lines you're interested in, and (optionally) which part of that is the key. The whole line is always printed.

So yes, I think something like the regex you should work. NB the regex engine (POSIX) I use doesn't support "\t" or "\d', so be careful with it. zindex --regex '^.{1,2}\s[0-9]+' works well in my local tests. To see what's going on, use --debug and you can see what's matching when building the index.

matthew@danger /tmp> zcat test.gz 
aa  1234    1234123131231231
bb  4444    1aasdhjkl123hjkl12123
matthew@danger /tmp> ~/dev/zindex/build/Release/zindex --regex '^.{1,2}\s[0-9]+' test.gz
matthew@danger /tmp> ~/dev/zindex/build/Release/zq test.gz aa\t1234
aa  1234    1234123131231231
jim892 commented 7 years ago

Hi Matt, Have to write and say "thanks so much" for developing zindex and for the good tutorial information above. I have it working now on a gzipped text file that is 58GB zipped (765GB unzipped). It has 220 million records with 360 columns of information in each record. I can retrieve any record, based on an index built from the first two column (like in the example above) in ~1 second. If I feed zq a list of 250 items, I get all results back in about 30 seconds. I was initially using zgrep which took 12 minutes per search.

More answers on how to specify "tab" for a tab-delimited file:

# The trick: put the commands with the tab character in a command file
# Or, to enter a tab character from command line:  ctrl-V then the tab key
#
# The search string needs to be in quotes as an argument to zq
#
#########################################
zcat t2.gz
#   this is tab character v (not spaces)
 zindex --regex '^.{1,2}    [0-9]+' t2.gz
#zindex --regex '^.{1,2}\s[0-9]+' t2.gz      # \s will also match tab
zq t2.gz "aa    1234" > output.txt
cat output.txt

#######################################
# Another way:
TAB=$'\t'     # define the tab character

zindex --regex "^.{1,2}${TAB}[0-9]+" t2.gz
zq t2.gz "aa${TAB}1234" > output.txt
cat output.txt

########################################
mattgodbolt commented 7 years ago

That's a great success story! Thanks for sharing :)