mikavilpas / link_title

Irssi plugin for manipulating links sent by others
1 stars 0 forks source link

amazon.com page identified as 'ascii c program text...' and no title found #7

Open mikavilpas opened 14 years ago

mikavilpas commented 14 years ago

What steps will reproduce the problem?

  1. link to http://www.amazon.com/Stainless-Steel-Gingerbread-Clip-Earrings/dp/B0030Z0SXQ/ref=sr_1_72?ie=UTF8&s=jewelry&qid=1275928286&sr=1-72

What is the expected output? What do you see instead? 19:38 -!- File: ASCII C program text, with very long lines the page has a title but it is not found

mikavilpas commented 14 years ago

Currently researching, findings:

the page is very large. It contains lots of empty lines, a hard-coded stylesheet (hundreds of lines), and quite a bit of javascript. The size of the file is 228KB. The title is found on line number 3076 (!).

However, the test script (non-irssi) was able to get the title so Irssi should too.

mikavilpas commented 14 years ago

Here's a useful find, might try this and see if it magically solves all problems: http://www.wdvl.com/Authoring/Languages/Perl/PerlfortheWeb/summarizer.html

mikavilpas commented 14 years ago

Tried TokeParser. It got the title from the link in this issue in my test script that's not running inside Irssi. Inside Irssi, however, it was unable to perform.

mikavilpas commented 14 years ago

Came across this module, which is designed to get the title from web pages as well as images, mp3s and pdf files. Will have to research and perhaps redesign the whole method for getting the title.

http://search.cpan.org/~tomi/URI-Title-1.82/lib/URI/Title.pm

mikavilpas commented 14 years ago

That module is not working out for me. I'm not going to develop support for it at this point. The idea about getting titles from pdf files, images and mp3 files is a very good one.

mikavilpas commented 14 years ago

My test script gets the title perfectly fine. However, the same exact block of code when running in Irssi does not.

Trying different options for HTML::TokeParser to see if they get the desired effect.