Closed ianmilligan1 closed 8 years ago
Actually, I'm not convinced that the loadArc
is working either. @aliceranzhou, can you share me the scripts that you used for this? Thanks. I may have almost certainly missed an API update.
Hmm, looking.
On Thu, Dec 3, 2015 at 4:44 AM Ian Milligan notifications@github.com wrote:
Actually, I'm not convinced that the loadArc is working either. @aliceranzhou https://github.com/aliceranzhou, can you share me the scripts that you used for this? Thanks. I may have missed an API update.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161569727.
Thanks! This is on the York Rho server, BTW.
Thanks. I've duplicated the warc error. It seems to be an issue with
getContentString
on the WARCRecord. Still looking..
For arc
, I forgot to use the domain instead of the full url. Perhaps
there are less than 10 links from urls per day. I've updated the script here
https://github.com/lintool/warcbase/wiki/Spark:-Analysis-of-Site-Link-Structure
to
use domains instead and also lower the count filter to 5.
On Thu, Dec 3, 2015 at 8:50 AM Ian Milligan notifications@github.com wrote:
This is on the York Rho server, BTW.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161645315.
Thanks, @aliceranzhou. Glad to hear you could reproduce it.
Re-running on the ARC dataset, I'll keep you posted.
Great, thanks!
I think I found the issue.. try adding import org.warcbase.spark.matchbox.RecordTransformers._
to the imports.
On Thu, Dec 3, 2015 at 9:29 AM Ian Milligan notifications@github.com wrote:
Re-running on the ARC dataset, I'll keep you posted.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161655936.
Actually, it needs to be included in the RecordLoader. I'll push a fix.
On Thu, Dec 3, 2015 at 9:41 AM Alice Zhou alice.zhou@gmail.com wrote:
Great, thanks!
I think I found the issue.. try adding
import org.warcbase.spark.matchbox.RecordTransformers._
to the imports.On Thu, Dec 3, 2015 at 9:29 AM Ian Milligan notifications@github.com wrote:
Re-running on the ARC dataset, I'll keep you posted.
— Reply to this email directly or view it on GitHub https://github.com/lintool/warcbase/issues/185#issuecomment-161655936.
^that wasn't the issue. Content can't be read twice, so I changed the functions to variables, only to be set once.
Merged to master now, please let me know if there are any issues.
OK, testing now. Will let you know in a few hours!
Closing... Ian, please re-open if you still find bugs...
Should have responded earlier - is working perfectly. FYI, we've uploaded the derivative dataset here.
I've been running tests extracting all links from a large collection. The following script works:
However, this script crashes:
The error dump is:
Producing an empty
part-00000
file. Any idea?