harelba / q

q - Run SQL directly on delimited files and multi-file sqlite databases
http://harelba.github.io/q/
GNU General Public License v3.0
10.15k stars 419 forks source link

Show lines not imported in the database #263

Open rvanlaar opened 3 years ago

rvanlaar commented 3 years ago

First of, q is a great product. It allows me to read a huge file, 5GB, without Out of Memory errors.

I'm missing around 20K lines between what wc -l gives me and select count(*) from file.csv in 'q'.

What gives, and how can I find out which lines are problematic?

harelba commented 3 years ago

Hi @rvanlaar sorry for the late response.

This is an interesting case, since q is supposed to throw an error if there's an issue with parsing/reading the data.

I wouldn't ask you to send actual data to me, but it would be great if you could narrow it down to specific sets of lines, so we can understand the issue better.

My suggestion is to use the following method:

  1. Run export MYFILE=myfile ; export T=30; export C=10; echo $(tail -$T $MYFILE | head -$C | q "select count(*) from -" -c 1) $(tail -$T $MYFILE | head -$C | wc -l). This prints the number of lines by q and by wc for some part of the file.
  2. The numbers should be the same
  3. Play with the values of T and C until there's a mismatch, and then narrow down the search.

Obviously, T=30 and C=10 are very small relative to your large file. I suggest using some binary search on the number of lines in the file in order to narrow this down as quickly as possible. If you're not familiar with binary search, please ping me here or in harelba@gmail.com and i'll gladly help.

Once you've narrowed it down to be small enough, two options:

  1. If you're comfortable with it, you can send me the data over email, and I'll figure this out.
  2. If not, then narrow it down to 2-3 lines and send them here or to me over email.

Harel

rvanlaar commented 3 years ago

Hi @harelba

I wanted to give you an update. I switched over to using pandas to do the import, which does import all the lines. So, for us it's not an urgent problem any more.

This problem still got my attention. I hope to find some time to follow your advice for the imports.

Roland