Closed szeke closed 8 years ago
Here is the document, it has no useful data.
We should filter out WebPages that have no "name" field. I wonder how to do this because we would also need to remove the offer, seller, etc.
Maybe we want to run a filter on the CDR docs first and toss out the bad web pages before we run Karma.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "dig-latest",
"_type": "webpage",
"_id": "http://dig.isi.edu/ht/data/webpage/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4",
"_score": 1,
"_source": {
"a": "WebPage",
"publisher": {
"a": "http://schema.org/Organization",
"uri": "http://dig.isi.edu/ht/data/organization/escortads.xxx",
"name": "escortads.xxx"
},
"url": "http://escortads.xxx/702-218-7871/?pid=61340175",
"uri": "http://dig.isi.edu/ht/data/webpage/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4",
"dateCreated": "2015-11-16T19:07:02",
"mainEntity": {
"a": "http://schema.org/Offer",
"availableAtOrFrom": {
"a": "http://schema.org/Place",
"uri": "http://dig.isi.edu/ht/data/offer/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4/place",
"address": [
{
"a": "http://schema.org/PostalAddress",
"uri": "http://dig.isi.edu/ht/data/offer/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4/address"
}
]
},
"uri": "http://dig.isi.edu/ht/data/offer/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4",
"seller": {
"a": "http://schema.dig.isi.edu/ontology/PersonOrOrganization",
"uri": "http://dig.isi.edu/ht/data/seller/9DE105EC7D7CF7EB863AD3843F4C20CC56AD2800"
},
"itemOffered": {
"a": "http://schema.dig.isi.edu/ontology/AdultService",
"age": [
"28"
],
"uri": "http://dig.isi.edu/ht/data/adultservice/9DE105EC7D7CF7EB863AD3843F4C20CC56AD2800",
"height": "175"
},
"validFrom": "2015-11-16T19:07:02"
}
}
}
]
}
}
@dkapoor close issue?
I added the script to remove the pages with no title as we pick it up from CDR. However we need to run the entire workflow to test it. So will close after that?
Closing even though we haven't tested it.
http://localhost:5009/webpage.html?value=http://dig.isi.edu/ht/data/webpage/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4&field=_id
Has no data, please investigate what is going on, perhaps we should delete these WebPages that have no data