o0111 / ruralcafe

Automatically exported from code.google.com/p/ruralcafe
0 stars 0 forks source link

Implementation of IsATextPage is incomplete #23

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
This method is used to decide whether to add a page to a package based on the 
richness filtering rules.

Original issue reported on code.google.com by shouldab...@gmail.com on 10 Oct 2010 at 8:42

GoogleCodeExporter commented 8 years ago
I locally re-implemented the method sending HEAD request to determine the 
content type. Is even that too much communication?

I can't test it, since Trotro won't change anything if I click "only download 
text content". I assume this is the option for low richness. Any explanation 
for that behaviour?

Original comment by satiaher...@gmx.de on 11 Apr 2013 at 8:36

GoogleCodeExporter commented 8 years ago
This is probably a TroTro bug. You're right that it should use this method to 
download only text content as appropriate.

Original comment by shouldab...@gmail.com on 18 Apr 2013 at 11:51

GoogleCodeExporter commented 8 years ago
Still needs to be tested with TroTro and logging's missing.

Original comment by satiaher...@gmx.de on 21 Apr 2013 at 10:35

GoogleCodeExporter commented 8 years ago

Original comment by satiaher...@gmx.de on 1 May 2013 at 10:54

GoogleCodeExporter commented 8 years ago
I combined the old and the new implementation. I removed some bugs from the 
methods determining the file extension in Util.cs.

Now it tries to guess the content type from the ending. If this fails, it sends 
out a HEAD request, with a timeout of 1s. I moved it to the other timeout 
constants that are apparently not configurable, too.

I renamed the method to IsATextPage.

I tested it with some pages and it seems to work fine.

Original comment by satiaher...@gmx.de on 1 May 2013 at 2:41