usc-isi-i2 / digapp-ht

Copy of dig-polymer for new data set
Other
0 stars 2 forks source link

Bad WebPage #2

Closed szeke closed 8 years ago

szeke commented 8 years ago

http://localhost:5009/webpage.html?value=http://dig.isi.edu/ht/data/webpage/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4&field=_id

Has no data, please investigate what is going on, perhaps we should delete these WebPages that have no data

szeke commented 8 years ago

Here is the document, it has no useful data.

We should filter out WebPages that have no "name" field. I wonder how to do this because we would also need to remove the offer, seller, etc.

Maybe we want to run a filter on the CDR docs first and toss out the bad web pages before we run Karma.

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "dig-latest",
                "_type": "webpage",
                "_id": "http://dig.isi.edu/ht/data/webpage/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4",
                "_score": 1,
                "_source": {
                    "a": "WebPage",
                    "publisher": {
                        "a": "http://schema.org/Organization",
                        "uri": "http://dig.isi.edu/ht/data/organization/escortads.xxx",
                        "name": "escortads.xxx"
                    },
                    "url": "http://escortads.xxx/702-218-7871/?pid=61340175",
                    "uri": "http://dig.isi.edu/ht/data/webpage/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4",
                    "dateCreated": "2015-11-16T19:07:02",
                    "mainEntity": {
                        "a": "http://schema.org/Offer",
                        "availableAtOrFrom": {
                            "a": "http://schema.org/Place",
                            "uri": "http://dig.isi.edu/ht/data/offer/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4/place",
                            "address": [
                                {
                                    "a": "http://schema.org/PostalAddress",
                                    "uri": "http://dig.isi.edu/ht/data/offer/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4/address"
                                }
                            ]
                        },
                        "uri": "http://dig.isi.edu/ht/data/offer/9CFE163AD83D8941F15B47B732C60423069DF0C5D05C6D83663CE395881582E4",
                        "seller": {
                            "a": "http://schema.dig.isi.edu/ontology/PersonOrOrganization",
                            "uri": "http://dig.isi.edu/ht/data/seller/9DE105EC7D7CF7EB863AD3843F4C20CC56AD2800"
                        },
                        "itemOffered": {
                            "a": "http://schema.dig.isi.edu/ontology/AdultService",
                            "age": [
                                "28"
                            ],
                            "uri": "http://dig.isi.edu/ht/data/adultservice/9DE105EC7D7CF7EB863AD3843F4C20CC56AD2800",
                            "height": "175"
                        },
                        "validFrom": "2015-11-16T19:07:02"
                    }
                }
            }
        ]
    }
}
szeke commented 8 years ago

@dkapoor close issue?

dkapoor commented 8 years ago

I added the script to remove the pages with no title as we pick it up from CDR. However we need to run the entire workflow to test it. So will close after that?

szeke commented 8 years ago

Closing even though we haven't tested it.