mfhepp / elmar-to-goodrelations

Automatically exported from code.google.com/p/elmar-to-goodrelations
0 stars 0 forks source link

Store HTTP Response Header data in RDF for each graph #1

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Please store the data from the HTTP request for the CSV file as meta-data
in the RDF graph for each store using

http://www.w3.org/TR/HTTP-in-RDF/

e.g. status code, size, media type, cache info, charset/encoding

This is important for a later analysis of the data

Original issue reported on code.google.com by mfhepp on 17 Jan 2011 at 2:18

GoogleCodeExporter commented 9 years ago
The URI above is wrong - it refers to an old, outdated editors draft of the 
HTTP-in-RDF Vocabulary. 

The latest Public Working Draft is here:
http://www.w3.org/TR/HTTP-in-RDF10/

Original comment by mfhepp on 17 Jan 2011 at 9:08

GoogleCodeExporter commented 9 years ago
Another approach: Use provenance vocab: (Thanks to Olaf Hartig)

<snip>

What you describe seems to be exactly one of the use cases we developed the 
Provenance Vocabulary [1] for:The Provenance Vocabulary provides the class 
prv:DataAccess  which represents the execution of a data access on the Web. 
Using the property  prvTypes:exchangedHTTPMessage  you can associate instances 
of  prv:DataAccess  with the HTTP messages that have been exchanged. These 
HTTP messages can then be described using the W3C RDF vocabulary for HTTP. 
Here's an example:

  foo:DataAboutProduct1
            foaf:primaryTopic foo:Product1 ;
            prv:createdBy _:dc .

  _:dc
            a prv:DataCreation ;
            # ... additional information about the creation process ...
            prv:usedData _:xml .

  _:xml
            a prv:DataItem ;
            prv:retrievedBy _:da .

  _:da
            a prv:DataAccess ;
            prv:accessedResource <http://www.heppnetz.de/companies.xml> ;
            prvTypes:exchangedHTTPMessage _:m .

  _:m
            a http:Response ;
            http:httpVersion "1.1" ;
            # ...
            http:statusCodeNumber "200" .

(Needless to say that you may use URIs instead of the blank node identifiers 
that I used in the example for the sake of readability.)

Our "Guide to the Provenance Vocabulary" contains another example in Section
"3.3.2 Related Vocabularies: HTTP Vocabulary in RDF" [2].

Greetings,
Olaf

[1] http://purl.org/net/provenance/
[2] http://purl.org/net/provenance/guide#HTTP_Vocabulary_in_RDF

</snip>

Original comment by mfhepp on 18 Jan 2011 at 10:08

GoogleCodeExporter commented 9 years ago
Basically, the code must be extended in line 400 of mainloops.py

        csv.register_dialect("short_life", delimiter=self.updateM.delimiter,quotechar=self.updateM.quoted,escapechar=self.updateM.escaped)
        dat2 = urllib.urlopen(datei, timeout=self.paramenter.timeout)
        reader = csv.reader(dat2, "short_life")

You may have to use urllib2 instead of urlib to access the http headers, good 
doc is here:

   http://www.voidspace.org.uk/python/articles/urllib2.shtml

import urllib2

user_agent = 'Elmar2GoodRelations)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, headers)
response = urllib2.urlopen(req)
the_page = response.read()
headers = response.info().headers

headers will then be a list with the header info:

['Date: Tue, 18 Jan 2011 11:32:01 GMT\r\n', 'Server: Apache\r\n', 
'Last-Modified: Sat, 27 Nov 2010 19:51:44 GMT\r\n', 'ETag: 
"193a1f07-4165-4cf16150"\r\n', 'Accept-Ranges: bytes\r\n', 'Content-Length: 
16741\r\n', 'Connection: close\r\n', 'Content-Type: text/html\r\n']

but you still have to split it into field name and field value.

If you know there parameter name, you can also access it directly

content_type = response.info().getheader('Content-Type')

Original comment by mfhepp on 18 Jan 2011 at 11:36