icy / google-group-crawler

[Deprecated] Get (almost) original messages from google group archives. Your data is yours.
215 stars 38 forks source link

Minor performance improvement: #7

Closed cuonglm closed 9 years ago

cuonglm commented 9 years ago
cuonglm commented 9 years ago

Another considered option for performance improvement can be using LC_ALL=C. But I'm not sure about the input data so I decide to make it in future.

Also note that on GNU systems with UTF-8 locales, sort -u does not report unique lines but the first from sequence of lines which sort the same:

$ printf '%b\n' '\U2460' '\U2461' | LC_ALL=en_US.utf8 sort -u
①
icy commented 9 years ago

Revise _short_url, using bash builtin string substitution

I often use sed, mostly because it's readable (esp. when using sed -r when possible.)

Also note that on GNU systems with UTF-8 locales, sort -u does not report unique lines but the first from sequence of lines which sort the same:

Yah I know. Let's keep that simple, though ;)

icy commented 9 years ago

Thanks a lot for your contribution, @Gnouc !

cuonglm commented 9 years ago

I often use sed, mostly because it's readable (esp. when using sed -r when possible.)

You should use undocumented sed -E with GNU sed (which equivalent to sed -r). -E option works in BSD sed, too and are going to be standard in next POSIX.

icy commented 9 years ago

Very valuable information :)