Text/html - Githubissues

thyrymn commented 9 years ago

When I try to clone a site, with both the stable and unstable branch, I get an error about the type of "text/html" is not supportd.

When I try to use the --wget option, it doesn't work because to mirror the site with wget I need to use the robots off command to allow wget to ignore the robots. The wget options ignores the .wgetrc file.

Any ideas?

thyrymn commented 9 years ago

Saving WebServer to [ /root/.cartero/templates/webserver ] Cloning URL https://www.blahblah.com unknown encoding name - text/html /root/Cartero/lib/cartero/commands/cloner.rb:220:in force_encoding' /root/Cartero/lib/cartero/commands/cloner.rb:220:increate_index' /root/Cartero/lib/cartero/commands/cloner.rb:146:in clone' /root/Cartero/lib/cartero/commands/cloner.rb:108:inrun' /root/Cartero/lib/cartero/command.rb:82:in block in method_added' /root/Cartero/lib/cartero/cli.rb:190:inblock in run' /root/Cartero/lib/cartero/cli.rb:184:in each' /root/Cartero/lib/cartero/cli.rb:184:inrun' ./cartero:52:in `

'

mrbrutti commented 9 years ago

Ok now I get the problem. I'll fix it asap. and push a commit. Working on a better implementation of how I get the content_type for a specific site.

thyrymn commented 9 years ago

want another one? different site:

Cloning URL https://www.blahblah.com bad URI(is not URI?): Email: myname at domain dot com /usr/local/rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/uri/rfc3986_parser.rb:66:in split' /usr/local/rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/uri/rfc3986_parser.rb:72:inparse' /usr/local/rvm/rubies/ruby-2.2.1/lib/ruby/2.2.0/uri/common.rb:226:in parse' /root/Cartero/lib/cartero/commands/cloner.rb:177:inblock in proccess_urls' /usr/local/rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.6.2/lib/nokogiri/xml/node_set.rb:187:in block in each' /usr/local/rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.6.2/lib/nokogiri/xml/node_set.rb:186:inupto' /usr/local/rvm/gems/ruby-2.2.1/gems/nokogiri-1.6.6.2/lib/nokogiri/xml/node_set.rb:186:in each' /root/Cartero/lib/cartero/commands/cloner.rb:176:inproccess_urls' /root/Cartero/lib/cartero/commands/cloner.rb:194:in create_index' /root/Cartero/lib/cartero/commands/cloner.rb:146:inclone' /root/Cartero/lib/cartero/commands/cloner.rb:108:in run' /root/Cartero/lib/cartero/command.rb:82:inblock in method_added' /root/Cartero/lib/cartero/cli.rb:190:in block in run' /root/Cartero/lib/cartero/cli.rb:184:ineach' /root/Cartero/lib/cartero/cli.rb:184:in run' ./cartero:52:in

'

mrbrutti commented 9 years ago

@thyrymn can you please test it using the latest available commit ? Let me know if that fixes your latest

mrbrutti commented 9 years ago

As for the latest comment you trying to clone an email ?

thyrymn commented 9 years ago

no. it is my person domain. no email. testing now.

mrbrutti commented 9 years ago

I asked because of the "bad URI(is not URI?): Email: myname at domain dot com" it look like an error in one of the underlying gems ... Could you check in your source of that if you have an url like email:test@test.com instead of mailto:test@test.com ? It is complaining and breaking out of something the RFC can't handle and if I can fix it I will happily fix it. Cloner needs to handle crazy amount of things, so the more sites we clone the more issues I can fix, if anything shows up.

thyrymn commented 9 years ago

I was at your thotcon talk, fwiw.

Issue 1: Site sort of mirrors. All text, no pictures. Looks like a gopher page. Error is gone. Issue 2: Is my resume. The thing it is hitting is a href:

<div id="contactDetails" class="quickFade delayFour">
                        <ul>
<a href="Email: myname at domain dot com" target="_blank">myname at domain dot com</a></li>

mrbrutti commented 9 years ago

weird. It should be cloning the website w/out any issues. It should look like

cartero Cloner -U https://www.gmail.com -p /tmp  -W gmail
cartero Listener -W /tmp/gmail -p 9090

mrbrutti commented 9 years ago

I will do some researching into the second one. That is an interesting issue. I guess I am being too smart on my builder and I try to edit links too much. I guess I can try to ignore these when they fail and leave them as they are.

thyrymn commented 9 years ago

Yea, I've cloned a bunch of sites that work right, I'm trying to find one that works like my first one.

thyrymn commented 9 years ago

I moved issue #2 to www.spiritualdictator.com so you can try it.

thyrymn commented 9 years ago

It looks like sites that have heavy use of javascript have problem #1.

mrbrutti commented 9 years ago

Thanks. Weird about the javascript. Issue number 1, was just an encoding issues. I am now allowing the underlying gem to determine encoding. It should not be related to javascript, but then again the internet is a weird world. If you have examples of sites that do not work. I will gladly add them to my testing and try to find the root of the issue.

In any case, I am still working on issue #2, but thanks for the feedback. The tool is only as good as people use it and report back to it so I can make it even more awesome. I really appreciate it.

mrbrutti commented 9 years ago

OK yet another commit. Please check. I used your site and I got it to render correctly. Also I am not coughing errors for unknown URIs and leaving them as they are. This not perfect, and there is still a know case in which it might not work, but as long as you are using ruby > 2.2.0 you should be ok. I'll eventually moving to it.

mrbrutti commented 9 years ago

Feel Free to reopen the issue or create a new one if you find anything else.

thyrymn commented 9 years ago

Issue 1 appear fully fixed in the release this am.

mrbrutti / Cartero

Text/html #1