rodricios / eatiht

An exercise in unsupervised machine learning: Extract Article's Text in HTml documents.
http://rodricios.github.io/eatiht
MIT License
434 stars 43 forks source link

Extractor seems to include a lot of js and in-line code? #15

Open AKAMEDIASYSTEM opened 9 years ago

AKAMEDIASYSTEM commented 9 years ago

Really enjoying eatiht, it's almost perfect for me.

I'm using it to process urls and send the text to a topic modeler. The output is admirably clean except for in-line code (which seems like it might be easy to detect?).

Here's a sample output that includes well-extracted text and then code:

In the best of worlds, we find an exact match d_{t'} and we use
d_{t' + delta} (suppose that t' + delta  
In practice, of course, prediction is much more complicated.  See
    [Sauer, 1994]  in 
  [Gershenfeld and
Weigend, 1993])  and the discussion in 

DelayCoordinateEmbedding.m  for more details.

mw.loader.implement("ext.vector.collapsibleNav",function($){(function($){var map={'ltr':{'opera':[['>=',9.6]],'konqueror':[['>=',4.0]],'blackberry':false,'ipod':false,'iphone':false,'ps3':false},'rtl':{'opera':[['>=',9.6]],'konqueror':[['>=',4.0]],'blackberry':false,'ipod':false,'iphone':false,'ps3':false}};if(!$.client.test(map)){return true;}var version=1;if(mediaWiki.config.get('wgCollapsibleNavForceNewVersion')){version=2;}else{if(mediaWiki.config.get('wgCollapsibleNavBucketTest')){version=$.cookie('vector-nav-pref-version');if(version==null){version=Math.round(Math.random()+1);$.cookie('vector-nav-pref-version',version,{'expires':30,'path':'/'});}}}if(version==2){var limit=5;var threshold=3;$('#p-lang ul').addClass('secondary').before('

');$('#p-lang-more h5').text(mw.usability.getMsg('vector-collapsiblenav-more'));$secondary.appendTo($('#p-lang-more div.body'));}$('#p-lang').addClass('persistent');}$('#mw-panel > div.portal:first').addClass('first persistent');$('#mw-panel').addClass('collapsible-nav');$('#mw-panel > div.portal:not(.persistent)').each(function(i){var id=$(this).attr('id');var state=$.cookie('vector-nav-'+id);if(state=='true'||(state==null&&i div.portal:not(.persistent) > h5');var tabIndex=$(document).lastTabIndex()+1;$('#searchInput').attr('tabindex',tabIndex++);$headings.each(function(){$(this).attr('tabindex',tabIndex++);});$('#mw-panel').delegate('div.portal:not(.persistent) > h5','keydown',function(event){if(event.which==13||event.which==32){toggle($(this));}}).delegate('div.portal:not(.persistent) > h5','mousedown',function(event){if(event.which!=3){toggle($(this));$(this).blur();}return false;});})(jQuery);;},{"all":
"#mw-panel.collapsible-nav div.portal{background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAIwAAAABCAMAAAA7MLYKAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAEtQTFRF29vb2tra4ODg6urq5OTk4uLi6+vr7e3t7Ozs8PDw5+fn4+Pj4eHh3d3d39/f6Ojo5eXl6enp8fHx8/Pz8vLy7+/v3Nzc2dnZ2NjYnErj7QAAAD1JREFUeNq0wQUBACAMALDj7hf6JyUFGxzEnYhC9GaNPG1xVffGDErk/iCigLl1XV2xM49lfAxEaSM+AQYA9HMKuv4liFQAAAAASUVORK5CYII=);background-image:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/portal-break.png?2014-12-19T03:13:20Z)!ie;background-position:left top;background-repeat:no-repeat;padding:0.25em 0 !important;margin:-11px 9px 10px 11px}#mw-panel.collapsible-nav div.portal h5{color:#4D4D4D;font-weight:normal;background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQBAMAAADt3eJSAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAA9QTFRFeXl53d3dmpqasbGx////GU0iEgAAAAV0Uk5T/////wD7tg5TAAAAK0lEQVQI12NwgQIG0hhCDAwMTCJAhqMCA4MiWEoIJABiOCooQhULi5BqMgB2bh4svs8t+QAAAABJRU5ErkJggg==) left center no-repeat;background:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/open.png?2014-12-19T03:13:20Z) left center no-repeat!ie;padding:4px 0 3px 1.5em;margin-bottom:0px}#mw-panel.collapsible-nav div.collapsed h5{color:#0645AD;background:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAMAAAAoLQ9TAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAAxQTFRF3d3deXl5////nZ2dQA6SoAAAAAN0Uk5T//8A18oNQQAAADNJREFUeNpiYEIDDMQKMKALMDOgCTDCRWACcBG4AEwEIcDITEAFuhnotmC4g4EEzwEEGAADqgHmQSPJKgAAAABJRU5ErkJggg==) left center no-repeat;background:url(http://www.scholarpedia.org/w/extensions/Vector/modules/images/closed-ltr.png?2014-12-19T03:13:20Z) left center no-repeat!ie;margin-bottom:0px}#mw-panel.collapsible-nav div h5:hover{cursor:pointer;text-decoration:none}#mw-panel.collapsible-nav div.collapsed h5:hover{text-decoration:underline}#mw-panel.collapsible-nav div.portal div.body{background:none !important;padding-top:0px;display:none}#mw-panel.collapsible-nav div.persistent div.body{display:block}#mw-panel.collapsible-nav div.first h5{display:none}#mw-panel.collapsible-nav div.persistent h5{background:none !important;padding-left:0.7em;cursor:default}#mw-panel.collapsible-nav div.portal div.body ul li{padding:0.25em 0}#mw-panel.collapsible-nav div.first{background-image:none;margin-top:0px}#mw-panel.collapsible-nav div.persistent div.body{margin-left:0.5em}\n\n/* cache key: wikidb:resourceloader:filter:minify-css:7:da8e8a773eccab12cc615654d05b8845 */\n"
},{"vector-collapsiblenav-more":"More languages"});mw.loader.implement("ext.vector.collapsibleTabs",function($){jQuery(function($){var rtl=$('body').is('.rtl');$.collapsibleTabs.moveToCollapsed=function(ele){var $moving=$(ele);var data=$.collapsibleTabs.getSettings($moving);if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=true;var target=data.collapsedContainer;$moving.css("position","relative").css((rtl?'left':'right'),0).animate({width:'1px'},"normal",function(){$(this).hide();$(' ').insertAfter(this);$(this).detach().prependTo(target).data('collapsibleTabsSettings',data);$(this).attr('style','display:list-item;');var data=$.collapsibleTabs.getSettings($(ele));if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.
shifting=false;$.collapsibleTabs.handleResize();});};$.collapsibleTabs.moveToExpanded=function(ele){var $moving=$(ele);var data=$.collapsibleTabs.getSettings($moving);if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=true;var $target=$(data.expandedContainer).find('span.placeholder:first');var expandedWidth=data.expandedWidth;$moving.css("position","relative").css((rtl?'right':'left'),0).css('width','1px');$target.replaceWith($moving.detach().css('width','1px').data('collapsibleTabsSettings',data).animate({width:expandedWidth+"px"},"normal",function(){$(this).attr('style','display:block;');var data=$.collapsibleTabs.getSettings($(this));if(!data){return;}var expContainerSettings=$.collapsibleTabs.getSettings($(data.expandedContainer));if(!expContainerSettings){return;}expContainerSettings.shifting=false;$.collapsibleTabs.handleResize();}));};$('#p-views ul').bind(
'beforeTabCollapse',function(){if($('#p-cactions').css('display')=='none'){$('#p-cactions').addClass('filledPortlet').removeClass('emptyPortlet').find('h5').css('width','1px').animate({'width':'26px'},390);}}).bind('beforeTabExpand',function(){if($('#p-cactions li').length==1){$('#p-cactions h5').animate({'width':'1px'},370,function(){$(this).attr('style','').parent().addClass('emptyPortlet').removeClass('filledPortlet');});}}).collapsibleTabs({expandCondition:function(eleWidth){if(rtl){return($('#right-navigation').position().left+$('#right-navigation').width()+1)$('#left-navigation').position().left;}else{return($('#left-navigation').position().left+$('#left-navigation').width())>$(
'#right-navigation').position().left;}}});});;},{},{});mw.loader.implement("ext.vector.simpleSearch",function($){jQuery(document).ready(function($){if($('#simpleSearch').length==0){return;}var map={'browsers':{'ltr':{'opera':[['>=',9.6]],'docomo':false,'blackberry':false,'ipod':false,'iphone':false},'rtl':{'opera':[['>=',9.6]],'docomo':false,'blackberry':false,'ipod':false,'iphone':false}}};if(!$.client.test(map)){return true;}if(window.os_MWSuggestDisable){window.os_MWSuggestDisable();}$('#simpleSearch > input#searchInput').attr('placeholder',mw.msg('vector-simplesearch-search')).placeholder();$('#searchInput, #searchInput2, #powerSearchText, #searchText').suggestions({fetch:function(query){var $this=$(this);if(query.length!==0){var request=$.ajax({url:mw.util.wikiScript('api'),data:{action:'opensearch',search:query,namespace:0,suggest:''},dataType:'json',success:function(data){if($.isArray(data)&&1 in data){$this.suggestions('suggestions',data[1]);}}});$this.data('request',request);}
},cancel:function(){var request=$(this).data('request');if(request&&$.isFunction(request.abort)){request.abort();$(this).removeData('request');}},result:{select:function($input){$input.closest('form').submit();}},delay:120,positionFromLeft:$('body').hasClass('rtl'),highlightInput:true}).bind('paste cut drop',function(e){$(this).trigger('keypress');});$('#searchInput').suggestions({result:{select:function($input){$input.closest('form').submit();}},special:{render:function(query){if($(this).children().length===0){$(this).show();var $label=$(' ',{'class':'special-label',text:mw.msg('vector-simplesearch-containing')}).appendTo($(this));var $query=$(' ',{'class':'special-query',text:query}).appendTo($(this));$query.autoEllipsis();}else{$(this).find('.special-query').empty().text(query).autoEllipsis();}},select:function($input){$input.closest('form').append($(' ',{type:'hidden',name:'fulltext',val:'1'}));$input.closest('form').submit();}},$region:$('#simpleSearch')}
);});;},{},{"vector-simplesearch-search":"Search","vector-simplesearch-containing":"containing..."});

/* cache key: wikidb:resourceloader:filter:minify-js:7:dac1b333490d761e05aeebe7d78de4c6 */

The most important phase space reconstruction technique is the  method of
delays . Vectors in a new space, the embedding space, are formed from time
delayed values of the scalar measurements:

The number  m  of elements is called the  embedding dimension , the time
  is generally referred to as the  delay  or  lag .  Celebrated
embedding theorems by Takens[ 21 ] and by Sauer et al.[ 22 ]
state that if the sequence   does indeed consist of scalar measurements
of the state of a dynamical system, then under certain genericity assumptions,
the time delay embedding provides a one-to-one image of the original set
 , provided  m  is large enough.

...I think I could get rid of the scripting/coding parts with some hacking, but wanted to bring this issue up here in case it was helpful to know, or in case I'm missing an obvious solution ;-)

Thanks for any help you can provide, and thanks more for making this awesome repo!

AKA

rodricios commented 9 years ago

Hi AKA! I've visited your site once or twice before. Very cool work!

As for the bug you bring up, it will most definitely be addressed in the next major fix. I'm talking new algo, using better features, and a more intuitive explanation as to what goes on!

In any case. I think I'm making a false assumption that this xpath selection will exclude the js later on, in which case the js script (the one you've shown) is a child/descendant of the final "content" node, and gets thrown in with the rest of the extracted text :|

Thanks for bringing this up.