Closed cslovell closed 5 years ago
Try this: convert the ache data to jl then upload. The data should be CDR format which is basically a json with the following fields,
{
"url": "http://someurl.com",
"raw_content": <ACTUAL HTML CONTENT>
}
The field url
is optional. Let me know if this works
I'm having trouble finding the ache data. My ache.yml settings are unchanged from the github repo: data_format.type: KAFKA, topic_name: ache, format: CDR31. In my projects folder, I go to the .ache/data subdirectory, and I have my "events" folder. I was looking for a data-pages folder but the only ones there are "config data_backlinks data_hosts data_monitor data_url metrics". Any tips?
myDIG's builtin ACHE is not fully integrated, we only tested it internally. A possible solution if you want to use ACHE: download latest ACHE from https://github.com/ViDA-NYU/ache -> deploy it and crawl data -> convert data to CDR format, concatenate docs to a json lines file (jl, not json) -> create a new project in myDIG and import jl file.
Thanks, this is working now! Appreciate the rapid response.
I'm getting started using the USC-DIG interface and I've got it running. I'm a bit stuck on the basics. I've got the mydig ui up, I've added a topic, which I've called "events". Then I go into ache, I set up a deep crawl over a list of websites that have the events I'm interested in, which I also call "statsevents," and the crawler runs, and it's got a lot of success codes, etc. In the project directory, though, there's nothing similar to the .jl file that is uploaded in the example docs. I can't see how to get the data from ache into the mydig gui. Just curious if you've got any tips on how to get this to work. Thanks a ton.