usc-isi-i2 / dig-etl-engine

Download DIG to run on your laptop or server.
http://usc-isi-i2.github.io/dig/
MIT License
101 stars 39 forks source link

Gettting data from ache to mydig gui #266

Closed cslovell closed 5 years ago

cslovell commented 5 years ago

I'm getting started using the USC-DIG interface and I've got it running. I'm a bit stuck on the basics. I've got the mydig ui up, I've added a topic, which I've called "events". Then I go into ache, I set up a deep crawl over a list of websites that have the events I'm interested in, which I also call "statsevents," and the crawler runs, and it's got a lot of success codes, etc. In the project directory, though, there's nothing similar to the .jl file that is uploaded in the example docs. I can't see how to get the data from ache into the mydig gui. Just curious if you've got any tips on how to get this to work. Thanks a ton.

saggu commented 5 years ago

Try this: convert the ache data to jl then upload. The data should be CDR format which is basically a json with the following fields,

{
    "url": "http://someurl.com",
    "raw_content":  <ACTUAL HTML CONTENT>
}

The field url is optional. Let me know if this works

cslovell commented 5 years ago

I'm having trouble finding the ache data. My ache.yml settings are unchanged from the github repo: data_format.type: KAFKA, topic_name: ache, format: CDR31. In my projects folder, I go to the .ache/data subdirectory, and I have my "events" folder. I was looking for a data-pages folder but the only ones there are "config data_backlinks data_hosts data_monitor data_url metrics". Any tips?

GreatYYX commented 5 years ago

myDIG's builtin ACHE is not fully integrated, we only tested it internally. A possible solution if you want to use ACHE: download latest ACHE from https://github.com/ViDA-NYU/ache -> deploy it and crawl data -> convert data to CDR format, concatenate docs to a json lines file (jl, not json) -> create a new project in myDIG and import jl file.

cslovell commented 5 years ago

Thanks, this is working now! Appreciate the rapid response.