matthiasn / BirdWatch

Tweet stream analysis and visualization with real-time updates.
http://matthiasnehlsen.com/
GNU Affero General Public License v3.0
856 stars 153 forks source link

How can one enable non-latin (e.g. cyrillic support) in the BirdWatch Clojure application? #31

Closed maxicusj closed 9 years ago

maxicusj commented 9 years ago

Hi,

What I see that if I use non-latin characters (e.g. want to track Україна or Россия (Ukraine, Russia)) in the twitterconf.edn, these track words are simply ignored if encoding of twitterconf.edn is set to Latin-1252. If I change the encoding of the config file to UTF-8, the TwitterClient crashes with the error below. Do you have an idea what is wrong? How can one enable non-latin (e.g. cyrillic) support in the BirdWatch Clojure application?

MfG, Roman

WARNING: update already refers to: #'clojure.core/update in namespace: clj-http. client, being replaced by: #'clj-http.client/update Exception in thread "main" java.lang.IllegalArgumentException: No implementation of method: :make-writer of protocol: #'clojure.java.io/IOFactory found for clas s: nil, compiling:(C:\Users\Administrator\AppData\Local\Temp\2\form-init46985096 215767084.clj:1:116) at clojure.lang.Compiler.load(Compiler.java:7206) at clojure.lang.Compiler.loadFile(Compiler.java:7150) at clojure.main$load_script.invoke(main.clj:274) at clojure.main$init_opt.invoke(main.clj:279) at clojure.main$initialize.invoke(main.clj:307) at clojure.main$null_opt.invoke(main.clj:342) at clojure.main$main.doInvoke(main.clj:420) at clojure.lang.RestFn.invoke(RestFn.java:421) at clojure.lang.Var.invoke(Var.java:383) at clojure.lang.AFn.applyToHelper(AFn.java:156) at clojure.lang.Var.applyTo(Var.java:700) at clojure.main.main(main.java:37) Caused by: java.lang.IllegalArgumentException: No implementation of method: :mak e-writer of protocol: #'clojure.java.io/IOFactory found for class: nil at clojure.core$_cache_protocol_fn.invoke(core_deftype.clj:555) at clojure.java.io$fn6825$G6779__6832.invoke(io.clj:69) at clojure.java.io$writer.doInvoke(io.clj:119) at clojure.lang.RestFn.invoke(RestFn.java:410) at clojure.lang.AFn.applyToHelper(AFn.java:154) at clojure.lang.RestFn.applyTo(RestFn.java:132) at clojure.core$apply.invoke(core.clj:628) at clojure.core$spit.doInvoke(core.clj:6661) at clojure.lang.RestFn.invoke(RestFn.java:425) at clj_pid.core$save.invoke(core.clj:16) at birdwatch_tc.main$_main.doInvoke(main.clj:39) at clojure.lang.RestFn.invoke(RestFn.java:397) at clojure.lang.Var.invoke(Var.java:375) at user$eval5.invoke(form-init46985096215767084.clj:1) at clojure.lang.Compiler.eval(Compiler.java:6767) at clojure.lang.Compiler.eval(Compiler.java:6757) at clojure.lang.Compiler.load(Compiler.java:7194) ... 11 more

matthiasn commented 9 years ago

Hi Roman, not sure what the issue could be, it works fine over here. I've changed the configuration files, twitterconf.edn:

:es-index                 "russia-ukraine"
:track                    "Україна,Россия"

Also conf.edn:

:es-index         "russia-ukraine"

It worked right away:

cyrillic

EDN files are always UTF-8 according to the documentation, which is what I thought, so there is no need to do any format conversions.

Did your local installation work before trying the cyrillic words?

Cheers, Matthias

maxicusj commented 9 years ago

Hi,

yes the installation works if using latin characters (e.g. Russia, Ukraine). The .edn file I downloaded with source is UNIX, Win-1252 encoding, not UTF-8. CHecked with Ultraedit.

I have a feeling it's maybe the OS issue? I am running the installation on Windows 2008R2 server. I assume you are on Unix?

matthiasn commented 9 years ago

That's odd as it should be UTF-8 either way. I'm mostly on Linux these days but I have actually tried this on OS X with Textmate for saving the file with the cyrillic characters. I can try again on Linux with IDEA tomorrow.

maxicusj commented 9 years ago

Hi, tried it again on OS X -- seems to work. Looks like issue was a comma in cyrillic. what bothers me though is very few tweets I get for a cyrillic keyword, e.g. Россия (Russia). Seems like only hashtags are parsed and not content? Many thanks!