stucco / docs

Documentation and Issue Tracking for Stucco
https://stucco.github.io/
Other
20 stars 7 forks source link

why postgres not data????? #6

Closed jtyoui closed 6 years ago

jtyoui commented 6 years ago

image

mikeiannacone commented 6 years ago

Hm, there are potentially a lot of reasons why this could be happening, going to need significantly more information.

How did you set up your instance? From the dev-setup repo? or installing each component manually?

Are your rt instances running? Are there any stucco-rt.*.log files being generated in the /home/stucco/rt directory (or wherever expected, based on your install?)

Is the collector instance(s) and rabbitmq running? Are there any stucco-scheduler-*.log files in the /var/log/supervisor/ dir? (assuming you're using supervisord, like the default install.)

jtyoui commented 6 years ago

I install the resource code. I now have postgresql data. but there is no unstructured data. in rabbitmq. there is no unstractured message queue? I looked at your collection.yanl configuration file . wilch did not download unstructured data. How can I download unstructurd data to extract relation?

jtyoui commented 6 years ago

stucco-rt-error.log no error data .everything is normal

jtyoui commented 6 years ago

dear sir. I have integrated your project into a maven project. please see:https://github.com/jtyoui/stucco

mikeiannacone commented 6 years ago

Good to hear!

There are separate rt processes (with different jar files) for structured and unstructured sources. Check your process list to see if there is a rt-unstructured.jar process running.

The rt workers initialize the rabbitmq queues (see here and here) so if they aren't started, nothing will create the queue.

The collector config will also need to be changed to collect those unstructured sources, by default it will use /home/stucco/collectors/config/collectors.yml. Note that there are some entity & relation types that can be generated from structured sources, but not generated from unstructured sources, mostly because we didn't have enough training data for all types. I forget the complete list (@testak may remember?) but for your testing I would start with something simple like IP addresses, and continue from there.

jtyoui commented 6 years ago

rabbitmq only hava structured data. /home/stucco/collectors/config/collectors.yml conf no find unstructurde data. why?

jtyoui commented 6 years ago

rt-unstructured.jar not error, everything is normal.

jtyoui commented 6 years ago

image

mikeiannacone commented 6 years ago

So it sounds like:

a) rt-unstructured is running, and nothing unusual is showing up in any of the stucco*.log files in ~stucco/rt/, or the log files in /var/log/supervisor/

b) rabbitmqctl list_queues is showing both a stucco-in-structured and stucco-in-unstructured queue, and the unstructured queue has no messages?

It sounds like those are both true based on your description, but I wanted to confirm that before proceeding. (For example, the null value of response that you found is expected when the queue is empty. This is why it will re-check after sleepTime has passed, assuming the persistent flag is true.)

If those are both true, it sounds like the collector (which supervisorctl will list as stucco-scheduler in the dev-setup env.) is not collecting any unstructured sources.

If so, check its configuration file to make sure that it has some unstructured sources included (there are none included by default.) These entries will have the data-type set to unstructured.

After that, stop the collector, delete the /home/stucco/CollectorMetadata* files, and then restart it. (Note that if you delete those files while it is running, it will generate new ones when it exits, so you must stop the process before deleting them.)

Let me know if that resolves things, or if my assumptions at the beginning were incorrect.

jtyoui commented 6 years ago

Recieved: stucco.in.unstructured.Bugtraq deliveryTag=[2] message- e6c31bc8-1b9a-4449-8dcd-e895b48d0fb7 http://www.securityfocus.com/bid/104102/info bb82b2a4-4223-4ead-88d2-40d82bc9a5df http://www.securityfocus.com/bid/104102/discuss 4e81151c-021a-47d7-871b-842db3cb3aed http://www.securityfocus.com/bid/104102/exploit 8322da73-00bf-4b39-a4e7-abc3538cba74 http://www.securityfocus.com/bid/104102/solution 6f6c12d7-0e92-424b-ab05-749c03230224 http://www.securityfocus.com/bid/104102/references

Retrieving document content from Document-Service for id 'e6c31bc8-1b9a-4449-8dcd-e895b48d0fb7 http://www.securityfocus.com/bid/104102/info bb82b2a4-4223-4ead-88d2-40d82bc9a5df http://www.securityfocus.com/bid/104102/discuss 4e81151c-021a-47d7-871b-842db3cb3aed http://www.securityfocus.com/bid/104102/exploit 8322da73-00bf-4b39-a4e7-abc3538cba74 http://www.securityfocus.com/bid/104102/solution 6f6c12d7-0e92-424b-ab05-749c03230224 http://www.securityfocus.com/bid/104102/references'. Could not fetch document 'e6c31bc8-1b9a-4449-8dcd-e895b48d0fb7 http://www.securityfocus.com/bid/104102/info bb82b2a4-4223-4ead-88d2-40d82bc9a5df http://www.securityfocus.com/bid/104102/discuss 4e81151c-021a-47d7-871b-842db3cb3aed http://www.securityfocus.com/bid/104102/exploit 8322da73-00bf-4b39-a4e7-abc3538cba74 http://www.securityfocus.com/bid/104102/solution 6f6c12d7-0e92-424b-ab05-749c03230224 http://www.securityfocus.com/bid/104102/references' from Document-Service.gov.pnnl.stucco.doc_service_client.DocServiceException: Cannot fetch from document server Annotating ''... Annotating with heuristic cyber labels ... Annotating with cyber labels ... {"vertices": {},"edges": [] } {"vertices":{},"edges":[]} Error occurred with routingKey = stucco.in.unstructured.Bugtraq java.lang.NullPointerException at gov.ornl.stucco.unstructured.UnstructuredTransformer.run(UnstructuredTransformer.java:145) at gov.ornl.stucco.RunUnstructured.main(RunUnstructured.java:12)

jtyoui commented 6 years ago

scheduler: collectors:

    source-name: Bugtraq
    type: PSEUDO_RSS
    data-type: unstructured
    source-URI: http://www.securityfocus.com/vulnerabilities
    content-type: text/html
    entry-regex: 'href="(/bid/\d+)"'
    tab-regex: 'href="(/bid/\d+/(info|discuss|exploit|solution|references))"'
    next-page-regex: 'href="(/cgi-bin/index\.cgi\?o[^"]+)">Next &gt;<'
    cron: 0 0 23 * * ?
    now-collect: all
jtyoui commented 6 years ago

The tow points you confirm are arrent. i fell that there is no unstructured data because collectors.yml has no definition data-type:unstructured . all are data-type:structurd . of course. the collector only has structured data. i tried to modify data-type:structured ---> unstructured .the result was reported wrong. the result of the error in the previous Issues.

mikeiannacone commented 6 years ago

Correct, there are no data-type:unstructured entries in the default config.

Adding one in the way you did is a reasonable way to test it, but the problem is that entry is actually collecting a group of related pages, indicated by the tab-regex item, and those are handled differently.

Try something more like this entry:

      -
        type: RSS
        data-type: unstructured
        source-name: threatexpert
        source-URI: http://www.threatexpert.com/latest_threat_reports.aspx
        content-type : text/html
        post-process : removeHTML
        now-collect: new
        cron: 0 39 * * * ?   # collect every hour

(also note the post-process option, as described in the collector repo readme file.)

I just tested with this, and everything seems to be working end-to-end. (However it looks like the NLP is performing poorly on that source, since it doesn't contain much 'natural' text.)