ualbertalib / HydraNorth

This repo is deprecated. Succeeded by https://github.com/ualbertalib/jupiter. This codebase was a IR built based on Samvera/Sufia
11 stars 4 forks source link

Sitemap error due to broken items #1195

Closed pbinkley closed 8 years ago

pbinkley commented 8 years ago

add error-handling to sitemap generation to log broken items for manual repair

pgwillia commented 8 years ago

To reproduce got uuid:7a385f26-6387-4f4e-9ba6-152044731c04.xml migrated with v1.2.1

output was

ERROR [line: 34] With input '"Smith, John B. \"Web-Based Systems and Instruction.\" Web. < http://www.cs.unc.edu/Research/jbsAr': Invalid token "\"Smith," (found "\"Smith,"), production = :RDFLiteral
ERROR [line: 34] Unexpected (found "http:"(PNAME_NS)), production = "."
ERROR [line: 35] Unexpected (found "A"), production = ")"
ERROR [line: 35] undefined prefix "http"
ERROR [line: 35] With input '//www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/ > Accessed 31 March, 2015.
    Smith,': Invalid token "//www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/" (found "//www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/"), production = :predicateObjectList
ERROR [line: 36] With input 'Smith, John B. and Catherine F. Smith. \"ChicoryLane Farm.\" Website. < http://www.chicorylane.com': Invalid token "Smith," (found "Smith,"), production = :_turtleDoc_1
ERROR [line: 34] With input '"Smith, John B. \"Web-Based Systems and Instruction.\" Web. < http://www.cs.unc.edu/Research/jbsAr': Invalid token "\"Smith," (found "\"Smith,"), production = :RDFLiteral
ERROR [line: 34] Unexpected (found "http:"(PNAME_NS)), production = "."
ERROR [line: 35] Unexpected (found "A"), production = ")"
ERROR [line: 35] undefined prefix "http"
ERROR [line: 35] With input '//www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/ > Accessed 31 March, 2015.
    Smith,': Invalid token "//www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/" (found "//www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/"), production = :predicateObjectList
ERROR [line: 36] With input 'Smith, John B. and Catherine F. Smith. \"ChicoryLane Farm.\" Website. < http://www.chicorylane.com': Invalid token "Smith," (found "Smith,"), production = :_turtleDoc_1
Save file used 9.799644534

Appears it took exception to the http://www.cs.unc.edu/Research/jbsArchive/docs/AutoGeneratedSystems/ that ends up in the description.

pgwillia commented 8 years ago

@weiweishi what do we want the error message to say? I can include the id.

Maybe something like: "There was a problem with #{o['id']} and it was not included in the sitemap.xml"

weiweishi commented 8 years ago

Maybe we can have something like #{o['id']}: ERROR to be included in sitemap.xml, so the id can be parsed out more easily? Can it catch the "Invalid Token" error message? If we can have that information there, it would be great.

{o['id']}: ERROR "Invalid Token", failed to be included in sitemap.xml

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Tue, Jun 21, 2016 at 12:42 PM, pgwillia notifications@github.com wrote:

@weiweishi https://github.com/weiweishi what do we want the error message to say? I can include the id.

Maybe something like: "There was a problem with #{o['id']} and it was not included in the sitemap.xml"

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/1195#issuecomment-227533373, or mute the thread https://github.com/notifications/unsubscribe/AB8-frL6jNywV4Tl4STfLU8I6ZFQN7pYks5qODChgaJpZM4I3s6F .

pgwillia commented 8 years ago

I'm using the Rails.logger.error so the ERROR would be redundant, I think. It'll appear in the production logs as something like

E, [2016-06-19T04:43:20.439294 #13017] ERROR -- : There was a problem with 9593tv13c and it was not included in the sitemap.xml -

Looks like the error I'm actually capturing is <ActiveFedora::ActiveFedoraError: Model mismatch. Expected GenericFile. Got: ActiveFedora::Base> which doesn't have any of that information about the 'invalid token' that is printed to the screen.

pgwillia commented 8 years ago

Because it's handled in rdf-turtle-1.1.7/lib/rdf/turtle/reader.rb

  # @option options [Boolean]  :validate     (false)
  #   whether to validate the parsed statements and values. If not validating,
  #   the parser will attempt to recover from errors.
...
rescue EBNF::LL1::Parser::Error, EBNF::LL1::Lexer::Error =>  e
  if validate?
    raise RDF::ReaderError.new(e.message, lineno: e.lineno, token: e.token)
  else
    $stderr.puts e.message
  end
end

I don't think I can influence the validate? outcome. It would have to be given at

  RDF::Reader.for(:ttl).new(StringIO.new(body), :base_uri => page_subject) do |reader|

[/usr/lib64/ruby/gems/2.1.0/gems/ldp-0.4.0/lib/ldp/response.rb:130] but it's not :(

weiweishi commented 8 years ago

Interesting. If we can capture the actual error message in the log it would be great. Otherwise, I'm fine with not having ERROR in the line, but it would be helpful to have ids up front. I will leave the rest of the language to you. This is going to help us a lot in going through existing objects, as it was not easy to detect them through audit.

Weiwei Shi

Digital Initiative Applications Librarian University of Alberta Libraries 2-10L Cameron Library Edmonton, Alberta, Canada T6G 2J8 Phone:(780)492-7802 Fax: (780)248-1209 Email: weiwei.shi@ualberta.ca

On Tue, Jun 21, 2016 at 2:10 PM, pgwillia notifications@github.com wrote:

Because it's handled in rdf-turtle-1.1.7/lib/rdf/turtle/reader.rb

@option options [Boolean] :validate (false)

whether to validate the parsed statements and values. If not validating,

the parser will attempt to recover from errors.

... rescue EBNF::LL1::Parser::Error, EBNF::LL1::Lexer::Error => e if validate? raise RDF::ReaderError.new(e.message, lineno: e.lineno, token: e.token) else $stderr.puts e.message end end

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ualbertalib/HydraNorth/issues/1195#issuecomment-227556833, or mute the thread https://github.com/notifications/unsubscribe/AB8-fivePtajbX94MRBWCMA58_-GGlQcks5qOEU8gaJpZM4I3s6F .

pgwillia commented 8 years ago

On era-test still fails after ~24 hours. This time seems to be related to file operations creating the sitemap.xml and fragments. Created #1249