sindresorhus / article-title

Extract the article title of a HTML document
MIT License
53 stars 7 forks source link

Title with colon #1

Open hemanth opened 9 years ago

hemanth commented 9 years ago
curl 'https://www.youtube.com/watch?v=ci0Pihl2zXY' -s | article-title

Results in:

#Edit2014

Commenting out this RE in the code, gives the right result that being Wikipedia: #Edit2014

//cc @sindresorhus

hemanth commented 9 years ago

@sindresorhus Meow :) ?

sindresorhus commented 9 years ago

Don't have time to look into this right now, but looks like I need to give higher priority to the last part of the title.

PR welcome in form of a fix or even just a failing test-case.

hemanth commented 9 years ago

:+1:

icyflame commented 9 years ago

@hemanth

I have done some work on this.

I have basically added two test cases.

One of them uses the node-curl package, and curls the page from Youtube and then finds the title. This returns an appropriate response. (Wikipedia: Edit14)

The second test is using a fixture. I have stored the HTML in a new fixture file, and then run the same test using a fixture. Surprisingly, this returns the result #Edit14 )

I am not able to figure out why the two differing results are found.

Find my new tests at: https://github.com/icyflame/article-title/blob/add-tests-for-titles-with-colon/test.js#L23

Output of the new test:

image

The one test that passes is the one which curls the youtube HTML.

cc: @sindresorhus

astoilkov commented 5 years ago

I am seeing a lot of incorrect results when using article-title on YouTube. Example:

I think two things can be done the docTitle stripping could be improved. Second, a new logic could be added where if you have only one h1 tag on the page this tag is considered the title.

astoilkov commented 5 years ago

@sindresorhus If you agree with the new logic regarding only one h1 on the page I can implement it and send a pull request.

sindresorhus commented 5 years ago

Yup. I agree. PR welcome :)