Closed GoogleCodeExporter closed 9 years ago
That's a good point. I looked into this when I started, but it had a number of
drawbacks. I'll revisit when i get a chance...If i can find a straightforward
way of doing this, maybe with a "--turbo" flag, I'll incorporate it. I'm
definitely open to input from the open source community regarding this as well.
Original comment by tobyro...@gmail.com
on 12 Mar 2012 at 10:52
Hi, please try this beta and let me know how it goes...use --quick_output for
large files.
Original comment by tobyro...@gmail.com
on 15 Mar 2012 at 11:21
Attachments:
I've tested it with commands like
cat big.file | pyp --quick_output "t[0]+'\t'+t[1]+'\t\t'+t[2]"
and it works as expected. Thanks!
Original comment by neatn...@gmail.com
on 17 Mar 2012 at 3:56
I found a couple of cases for which passing -q leads to some missing lines of
output:
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp -q "rel(r'^\d [23]')"
1 1
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp "(int(w[1]) not in {2,3})"
1 1
4 4
5 5
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp -q "(int(w[1]) not in {2,3})"
1 1
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp "rel(r'^\d [23]')"
1 1
4 4
5 5
Original comment by neatn...@gmail.com
on 3 Apr 2012 at 8:23
ok, thanks for the update on the beta. I'll look into this.
Original comment by tobyro...@gmail.com
on 3 Apr 2012 at 10:04
I *think* the issue is that Pyp.n is not being incremented, so safe_eval() does
nothing after the first line that evaluates to False. But I don't understand
the code well enough to know how to fix it.
Original comment by neatn...@gmail.com
on 4 Apr 2012 at 3:52
Ok,please try this on. Curly brackets don't for me, but the command is
essentially the same. Let me know if it works. This is the latest beta that
should also deal with unintentional stripping. It's still beta though, so
please let me know if you see any weirdness.
for i in 1 2 3 4 5; do echo "$i $i"; done | pyp_beta_2.11.5.py -q "(int(w[1])
not in [2,3])"
1 1
4 4
5 5
Original comment by tobyro...@gmail.com
on 14 Apr 2012 at 8:19
Attachments:
new pyp_beta should fix this:
http://code.google.com/p/pyp/downloads/detail?name=pyp_beta&can=2&q=#makechanges
Original comment by tobyro...@gmail.com
on 16 May 2012 at 9:37
It seems that the new pyp_beta(2.11.23) does not include this --quick_output
feature, right?
Original comment by apte...@gmail.com
on 20 May 2012 at 3:13
Quick output mode is now on by default unless using one of the list
operators(pp, spp, fpp), so we removed the flag. Cheers, t
Original comment by tobyro...@gmail.com
on 20 May 2012 at 5:07
[deleted comment]
That's weird. You should see immediate output without the redirection. Is that
your exact command? Make sure you are running pyp_beta. Let me know if the
older version with the flag runs faster.
Thanks,
T
Original comment by tobyro...@gmail.com
on 20 May 2012 at 6:31
Hi, are you still seeing these issues with pyp_beta?
Thanks, t
Original comment by tobyro...@gmail.com
on 22 May 2012 at 1:19
Sorry for late response. I tested pyp_beta_2.11.1 and 2.11.5 using the
--quick_output option, and both worked for large file without first loading the
file.
Original comment by apte...@gmail.com
on 22 May 2012 at 2:51
Ok, thanks for checking that out. We'll look into this...that's a key feature.
t
Original comment by tobyro...@gmail.com
on 22 May 2012 at 3:45
Hi, I think I've found the problem...could you try this version and let me know
how it goes?
Thanks again for your help, it's a good suggestion, and I think it feels more
responsive when running simple commands. It's got a fairly complex switching
routine, so it's taking a while to iron out the bugs.
t
Original comment by tobyro...@gmail.com
on 2 Jun 2012 at 4:47
Attachments:
With the new pyp_beta, I can get output without loading whole file into memory.
However, it seems that pyp_beta is quite slow for large file processing.
I tested the performance for awk and pyp, using the following simple example.
The file(article_categories_en.nt, around 2G) I use is downloaded from DBpedia,
which contains about ten million lines.
/usr/bin/time -o awk.time cat article_categories_en.nt | awk '{print $1,$3}' > test.awk
/usr/bin/time -o pyp.time cat article_categories_en.nt | ./pyp_beta 'w[1],w[3]' > test.pyp
I am not sure if I do thing right(I am new both to pyp and awk). Using the
above commands, awk takes around 13s to produce the output file, which is
around 1.5G. For pyp_beta, half an hour passed but it is still running, and
only produces about 30M output file.
Though this case is too easy to show the true power of pyp, it seems that the
performance issue is really annoying.
Original comment by apte...@gmail.com
on 2 Jun 2012 at 7:22
I assume you see this level of performance with the earlier pyp betas as well.
Unfortunately, I think we'll have to go to fully compiled code to get the level
of performance you need. Most users of pyp are working on much smaller data
sets. Thanks for your help testing....hopefully we will get this compiled at
some point.
Original comment by tobyro...@gmail.com
on 6 Jun 2012 at 6:05
Original issue reported on code.google.com by
neatn...@gmail.com
on 4 Mar 2012 at 2:42