lmjohns3 / theanets

Neural network toolkit for Python
http://theanets.rtfd.org
MIT License
328 stars 74 forks source link

Error in Quick Start: Recurrent Models #133

Open saddy001 opened 7 years ago

saddy001 commented 7 years ago

Hi,

I think there's something wrong with the quick start example. I see rising accuracy but no real words:

sed for light, but only as an oi|ma;loaob1lu oh eoobol g oop"ebaoiple (55.3%) downhill: RMSProp 170 loss=1.442173 err=1.442173 acc=0.552059 used for light, but only as an oi| luaiabafeoeflrbabnoahaao hreokbhhiaba e (55.2%) downhill: validation 17 loss=1.446977 err=1.446977 acc=0.550113 * downhill: RMSProp 171 loss=1.440421 err=1.440421 acc=0.552269 used for light, but only as an oi|iaoi -h.rbasop,lbea htpl?cbhiaaeb3eonylb (55.2%) downhill: RMSProp 172 loss=1.439116 err=1.439116 acc=0.553969 used for light, but only as an oi|eaa.epatiboh,r? tbo rh ouoif;efetfeiu i (55.4%) downhill: RMSProp 173 loss=1.443268 err=1.443268 acc=0.551297 used for light, but only as an oi|agoeea,eoswino-oaait oateerfaeraoeoa o (55.1%) downhill: RMSProp 174 loss=1.438152 err=1.438152 acc=0.553594 used for light, but only as an oi|oa ghoia.aaaa0 am e b ,sbct;aoaoabo epa, (55.4%) downhill: RMSProp 175 loss=1.432567 err=1.432567 acc=0.554312 used for light, but only as an oi|ofmaha,;orhooaaapebeohi!-e.hca pih mwhh (55.4%) downhill: RMSProp 176 loss=1.433838 err=1.433838 acc=0.554353 used for light, but only as an oi|l ray a aiaal'a.btaea-ataaomhbabr,esal (55.4%) downhill: RMSProp 177 loss=1.435675 err=1.435675 acc=0.553609 used for light, but only as an oi|;apb,e3eeibios ,aysta- ,;re ooadielai es (55.4%) downhill: RMSProp 178 loss=1.439022 err=1.439022 acc=0.552759 used for light, but only as an oi|etnatio ail,b ; ulo pblh e,aboo,yibeey. (55.3%) downhill: RMSProp 179 loss=1.432462 err=1.432462 acc=0.553709 used for light, but only as an oi|oeooneay noia.eaaoaioaeho.b ababb lsebna (55.4%)

In the example, meaningful words emerge at ~50% ACC. I had to make 2 small changes to the example: First, the corpus is compressed now at gutenberg, so I had to decompress it. Second, I had to change seed = txt.encode(txt.text[300017:300050]) to seed = txt.encode(txt.text[300015:300048]) To get the same sentence seed.

lmjohns3 commented 7 years ago

Interesting, thanks for the report. I'll try to look into it in the next couple weeks. Please feel free to send a PR to fix up the compression and indexing issues if you like.

saddy001 commented 7 years ago

curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt > corpus.txt should be curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt |gunzip -c > corpus.txt in the docs. The correct index can be seen in my comment above.