masayuki14 / worklog

Record working log by issues.
MIT License
0 stars 0 forks source link

Dockerfile for Embulk #4

Closed masayuki14 closed 6 years ago

masayuki14 commented 6 years ago

Embulkが動くDockerfileを作る

masayuki14 commented 6 years ago

https://hub.docker.com/_/java/ Javaの公式imageをベースにする。

masayuki14 commented 6 years ago

容量足りない、みたいなエラーが出る。

Step 3/10 : RUN apt-get upgrade
 ---> Running in 21b8b755d7d4
Reading package lists...
Building dependency tree...
Reading state information...
The following packages have been kept back:
  openjdk-8-jdk openjdk-8-jdk-headless openjdk-8-jre openjdk-8-jre-headless
The following packages will be upgraded:
  base-files bzr ca-certificates curl debconf debconf-i18n
  debian-archive-keyring git git-man gnupg gpgv libc-bin libc6 libcups2
  libcurl3 libcurl3-gnutls libdb5.3 libexpat1 libffi6 libfreetype6 libgcrypt20
  libgdk-pixbuf2.0-0 libgdk-pixbuf2.0-common libgnutls-deb0-28 libgraphite2-3
  libgssapi-krb5-2 libgtk2.0-0 libgtk2.0-bin libgtk2.0-common libicu52
  libjasper1 libk5crypto3 libkrb5-3 libkrb5support0 liblcms2-2 libldap-2.4-2
  libncurses5 libncursesw5 libnss3 libpam-modules libpam-modules-bin libpam0g
  librtmp1 libssl1.0.0 libsvn1 libsystemd0 libtasn1-6 libtiff5 libtinfo5
  libudev1 libx11-6 libx11-data libx11-dev libx11-doc libx11-xcb1 libxcursor1
  libxfixes3 libxi6 libxml2 libxrandr2 libxtst6 login mercurial
  mercurial-common multiarch-support ncurses-base ncurses-bin openssh-client
  openssl passwd perl perl-base perl-modules python-bzrlib sensible-utils
  subversion systemd systemd-sysv tzdata udev unzip wget
82 upgraded, 0 newly installed, 0 to remove and 4 not upgraded.
Need to get 59.3 MB of archives.
After this operation, 2279 kB disk space will be freed.
Do you want to continue? [Y/n] Abort.
The command '/bin/sh -c apt-get upgrade' returned a non-zero code: 1
masayuki14 commented 6 years ago

Docker Preference > Disk > [Resize disk image]

64GB で allocated:17GB くらいなので容量あるはず。 64 -> 96 に変えてみるもへんかなし。

masayuki14 commented 6 years ago

とりあえず既存imgae全部消す。
$ docker images -q | xargs -I_ docker rmi _

消しても変化なし。

masayuki14 commented 6 years ago
RUN apt-get -y upgrade

-y オプションつけたらうまく行った。 つけなくてもうまくいくときもあるので、ベースイメージ次第かな。 これからやつけるようにしよう。

masayuki14 commented 6 years ago

Docker で embulk 動かせられるので、input json でやってみる。

$ embulk example ./example 実行

masayuki14 commented 6 years ago

seed.yml を編集

in:
  type: file
  path_prefix: '/work/./example/json/tripadvisor_'
out:
  type: stdout

example/json/ にtripadvisorのJSONデータをうつしておく

masayuki14 commented 6 years ago

guess コマンド実行で config.yml つくる

 embulk guess example/seed.yml -o config.yml
2018-02-13 02:31:59.138 +0000: Embulk v0.9.2

********************************** INFORMATION **********************************
  Join us! Embulk-announce mailing list is up for IMPORTANT announcement such as
    compatibility-breaking changes and key feature updates.
  https://groups.google.com/forum/#!forum/embulk-announce
*********************************************************************************

2018-02-13 02:32:02.560 +0000 [INFO] (main): Gem's home and path are set by default: "/root/.embulk/lib/gems"
2018-02-13 02:32:03.377 +0000 [INFO] (main): Started Embulk v0.9.2
2018-02-13 02:32:03.443 +0000 [INFO] (0001:guess): Listing local files at directory '/work/example/json' filtering filename by prefix 'tripadvisor_'
2018-02-13 02:32:03.445 +0000 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-02-13 02:32:03.454 +0000 [INFO] (0001:guess): Loading files [/work/example/json/tripadvisor_uji_things_to_do_20180209.json]
2018-02-13 02:32:03.475 +0000 [INFO] (0001:guess): Try to read 32,768 bytes from input source
2018-02-13 02:32:03.553 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:32:03.568 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:32:03.588 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:32:03.615 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
in:
  type: file
  path_prefix: /work/./example/json/tripadvisor_
  parser: {charset: UTF-8, newline: LF}
out: {type: stdout}

Created 'config.yml' file.
masayuki14 commented 6 years ago

type: file だったからほとんど変わらなかった。 type: json ならいいのかな。

masayuki14 commented 6 years ago

file: json にしたらエラー出た。ちゃんと調べよう。

Error: InputPlugin 'json' is not found.
Unknown input plugin 'json'. embulk/input/json.rb is not installed. Run 'embulk gem search -rd embulk-input' command to find plugins.
masayuki14 commented 6 years ago

https://takeshiyako.blogspot.jp/2015/04/embulk-json-google-bigquery.html https://qiita.com/shun0102/items/8989e6ed2ee0f46a0fa9

embulk-parser-jsonl を使えばいいらしい。

masayuki14 commented 6 years ago

http://www.embulk.org/plugins/ やはり公式をみるべし

masayuki14 commented 6 years ago

FILE PARSERjsonl 記載がある。

masayuki14 commented 6 years ago

$ embulk gem install embulk-parser-jsonl

masayuki14 commented 6 years ago
$ embulk guess -g jsonl example/seed.yml -o config.yml
2018-02-13 02:45:52.092 +0000: Embulk v0.9.2

********************************** INFORMATION **********************************
  Join us! Embulk-announce mailing list is up for IMPORTANT announcement such as
    compatibility-breaking changes and key feature updates.
  https://groups.google.com/forum/#!forum/embulk-announce
*********************************************************************************

2018-02-13 02:45:54.356 +0000 [INFO] (main): Gem's home and path are set by default: "/root/.embulk/lib/gems"
2018-02-13 02:45:55.125 +0000 [INFO] (main): Started Embulk v0.9.2
2018-02-13 02:45:55.185 +0000 [INFO] (0001:guess): Listing local files at directory '/work/example/json' filtering filename by prefix 'tripadvisor_'
2018-02-13 02:45:55.187 +0000 [INFO] (0001:guess): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-02-13 02:45:55.201 +0000 [INFO] (0001:guess): Loading files [/work/example/json/tripadvisor_uji_things_to_do_20180209.json]
2018-02-13 02:45:55.225 +0000 [INFO] (0001:guess): Try to read 32,768 bytes from input source
2018-02-13 02:45:55.305 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:45:55.330 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:45:55.365 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:45:55.626 +0000 [INFO] (0001:guess): Loaded plugin embulk (0.9.2)
2018-02-13 02:45:55.679 +0000 [INFO] (0001:guess): Loaded plugin embulk-parser-jsonl (0.2.0)
org.jruby.exceptions.RaiseException: (ParserError) A JSON text must at least contain two octets!
    at json.ext.Parser.initialize(json/ext/Parser.java:175)
    at json.ext.Parser.new(json/ext/Parser.java:151)
    at RUBY.parse(uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/json/common.rb:155)
    at RUBY.block in guess_lines(/root/.embulk/lib/gems/gems/embulk-parser-jsonl-0.2.0/lib/embulk/guess/jsonl.rb:18)
    at org.jruby.RubyArray.each(org/jruby/RubyArray.java:1735)
    at RUBY.guess_lines(/root/.embulk/lib/gems/gems/embulk-parser-jsonl-0.2.0/lib/embulk/guess/jsonl.rb:17)
    at RUBY.guess(uri:classloader:/gems/embulk-0.9.2-java/lib/embulk/guess_plugin.rb:121)
    at RUBY.guess(uri:classloader:/gems/embulk-0.9.2-java/lib/embulk/guess_plugin.rb:24)

Error: (ParserError) A JSON text must at least contain two octets!
masayuki14 commented 6 years ago

guess にこだわらなくてもいいか。

masayuki14 commented 6 years ago

josnが行データじゃないからだめかもしらん。

[
{ ... },
{ ... }
]
masayuki14 commented 6 years ago

$ embulk gem install embulk-parser-json 普通のJSONパーサーにしてみる。

masayuki14 commented 6 years ago

https://github.com/takumakanari/embulk-parser-json config.yml を Example を参考に自分で書く

in:
  type: file
  path_prefix: /work/./example/json/tripadvisor_
  parser:
    type: jsonpath
    root: $
    stop_on_invalid_record: false
    schema:
      - { name: detail_url,       type: string }
      - { name: title,            type: string }
      - { name: rate,             type: string }
      - { name: review,           type: long }
      - { name: part,             type: string }
      - { name: tags,             type: string }
      - { name: rating5,          type: long }
      - { name: rating4,          type: long }
      - { name: rating3,          type: long }
      - { name: rating2,          type: long }
      - { name: rating1,          type: long }
      - { name: street_address,   type: string }
      - { name: address_locality, type: string }
      - { name: postal_code,      type: string }
      - { name: place_id_g,       type: string }
      - { name: place_id_d,       type: string }
      - { name: lng,              type: double }
      - { name: lat,              type: double }
      - { name: images,           type: string, path: "images[0]" }

out: {type: stdout}
masayuki14 commented 6 years ago

dry run. うまく行った。

type: integer, type: int はだめで、 type: long にしたらOKだった。 この type に指定できる型ってJavaのやつなんだろうか。Documentどこだろう。

$ embulk preview config.yml
2018-02-13 04:54:56.513 +0000: Embulk v0.9.2

********************************** INFORMATION **********************************
  Join us! Embulk-announce mailing list is up for IMPORTANT announcement such as
    compatibility-breaking changes and key feature updates.
  https://groups.google.com/forum/#!forum/embulk-announce
*********************************************************************************

2018-02-13 04:55:00.411 +0000 [INFO] (main): Gem's home and path are set by default: "/root/.embulk/lib/gems"
2018-02-13 04:55:01.312 +0000 [INFO] (main): Started Embulk v0.9.2
2018-02-13 04:55:01.404 +0000 [INFO] (0001:preview): Listing local files at directory '/work/example/json' filtering filename by prefix 'tripadvisor_'
2018-02-13 04:55:01.408 +0000 [INFO] (0001:preview): "follow_symlinks" is set false. Note that symbolic links to directories are skipped.
2018-02-13 04:55:01.431 +0000 [INFO] (0001:preview): Loading files [/work/example/json/tripadvisor_uji.json, /work/example/json/tripadvisor_uji.json~, /work/example/json/tripadvisor_uji.json.jq]
2018-02-13 04:55:01.456 +0000 [INFO] (0001:preview): Try to read 32,768 bytes from input source
2018-02-13 04:55:01.855 +0000 [INFO] (0001:preview): Loaded plugin embulk-parser-json (0.0.7)
2018-02-13 04:55:01.870 +0000 [WARN] (0001:preview): 'embulk-parser-json' has been deprecated.
2018-02-13 04:55:01.870 +0000 [WARN] (0001:preview): Just use 'embulk-parser-jsonpath' (https://rubygems.org/gems/embulk-parser-jsonpath) instead.
+------------------------------------------------------------------------------------------------------------------------+-----------------+-------------+-------------+-------------+--------------------------------------------------------------------------+--------------+--------------+--------------+--------------+--------------+-----------------------+-------------------------+--------------------+-------------------+-------------------+------------+------------+---------------+
|                                                                                                      detail_url:string |    title:string | rate:string | review:long | part:string |                                                              tags:string | rating5:long | rating4:long | rating3:long | rating2:long | rating1:long | street_address:string | address_locality:string | postal_code:string | place_id_g:string | place_id_d:string | lng:double | lat:double | images:string |
+------------------------------------------------------------------------------------------------------------------------+-----------------+-------------+-------------+-------------+--------------------------------------------------------------------------+--------------+--------------+--------------+--------------+--------------+-----------------------+-------------------------+--------------------+-------------------+-------------------+------------+------------+---------------+
| https://www.tripadvisor.com/Attraction_Review-g946495-d1867744-Reviews-Sawarabi_Street-Uji_Kyoto_Prefecture_Kinki.html | Sawarabi Street |        4.0  |          32 |             | Points of Interest & Landmarks,Historic Walking Areas,Sights & Landmarks |            5 |           15 |           12 |            0 |            0 |                   Uji |                    Uji, |                    |           g946495 |          d1867744 |  135.78813 |   34.88997 |               |
+------------------------------------------------------------------------------------------------------------------------+-----------------+-------------+-------------+-------------+--------------------------------------------------------------------------+--------------+--------------+--------------+--------------+--------------+-----------------------+-------------------------+--------------------+-------------------+-------------------+------------+------------+---------------+
masayuki14 commented 6 years ago

image:string が取れていないので調べる。

masayuki14 commented 6 years ago

typoがあった。 画像パスが imgaes になってる。

masayuki14 commented 6 years ago

なおしたらちゃんとURLとれた。