trec-dd / trec-dd-assessing

Tools for NIST's assessors working TREC DD
MIT License
0 stars 0 forks source link

Missing documents/server error in UI for Illicit #17

Open gvcormac opened 9 years ago

gvcormac commented 9 years ago

My "browse" list has the following document. When I click on it I get "server error."

com_blackhatworld_www_6dd9a64484474e6b3ed9d77e23f90ce96ae0b311_1422722123679

gvcormac commented 9 years ago

OK, now I'm totally stuck. All the docs are "server error" and I have no way to refresh the list.

screen shot 2015-04-10 at 9 55 16 am

shawn67 commented 9 years ago

I've sent out an doc list email for illicit goods, I just sent it again. And i also checked manually, these docs are indeed not indexed at our side.

shawn67 commented 9 years ago

refer to the http://infosense.cs.georgetown.edu/resource/bhw_list

gvcormac commented 9 years ago

You are missing 60,523 documents, some from bhw, some from hackforums. Here is a list of what you are missing:

http://plg.uwaterloo.ca/~gvcormac/missing_list.txt

$wc missing_list.txt 60523 60523 4590749 missing_list.txt

As a workaround, I will temporarily remove these from the Waterloo end.

shawn67 commented 9 years ago

May I know how your side rendered the illicit goods dataset?

Because we only indexed the docs containing "features" field, which are considered to be a thread and having real content. (referring to my email)

So for those "missing documents", even if we add them, there would be no content inside.

gvcormac commented 9 years ago

I have no knowledge of this. We rendered every document. I don't know what you mean by "no content." The document listed above contains the following.

[ . . . ]

Default Re: ALL-IN-ONE MANUAL SUBMISSION SERVICES - Cheap and Best Service

 I am interested in this package of social bookmarking 350 PR9-PR0 -
 $35. If you can please be online on YIM that will be great. Thank
 you.
   [58]Reply With Quote Reply With Quote
     ______________________________________________________________

2.
3. 07-17-2009, 12:02 PM [59]#137
   [60]Xnode's Avatar
   [61]Xnode
   Xnode is offline Newbies
   [62]Send a message via Skype(TM) to Xnode

    Join Date
            Jun 2009

    Location
            United States

    Posts
            21

    Thanks
            20
            Thanked 6 Times in 6 Posts

Default Re: ALL-IN-ONE MANUAL SUBMISSION SERVICES - Cheap and Best Service

 I recently had my order completed and have nothing but good things
 to say. the work was done very quickly and professionally. I look
 forward to having more work done in the future. I would highly
 recommend the services to anyone. Thanks again crmarjunkarthik!
   [63]Reply With Quote Reply With Quote
     ______________________________________________________________

4.

The Following User Says Thank You to Xnode For This Useful Post:

 [64]crmarjunkarthik (07-19-2009)
gvcormac commented 9 years ago

I think you need a more robust response than "server error" for documents you don't expect (and similar "file not found" errors). When you get "server error" you lose all controls. Maybe this will be mitigated when the controls are separate and when there is a manual refresh, but I still think that some sharp edges need to be removed.

shawn67 commented 9 years ago

I'll go check the documents and also take care of your suggestion above, currently just keep the those "missing" docs removed.

gvcormac commented 9 years ago

I agree that there are more important priorities than including these documents. They are currently excluded.

shawn67 commented 9 years ago

Yes. One thing i want to bring up: in the CCA schema, there are 'key', 'url', 'request', 'response' etc. fields.

For other datasets, what i think we should regard as content to show the assessor is item['response']['body'], for example, in the ebola dataset from NYU, the crawled html page source code is inside item['response']['body']

But for illicit goods, they added a 'features' field along with 'key', 'url', 'request',' response' They merged all posts under a certain thread, and the posts' contents are included in "features" field.

gvcormac commented 9 years ago

We are using response.body for illicit goods.