Open nasry opened 7 years ago
Could you try with <heritrix3>
instead of <heritrix>
eg
<settings>
<harvester>
<harvesting>
<heritrix3>
<bundle>/home/test/heritrix3-bundler-5.1.zip</bundle>
...
Hi @bnfklm yes I am putting it there! Okay I will try < heritrix3 > instead of < heritrix >
@bnfklm it does not work I got java.lang.ExceptionInInitializerError
What's the whole stack?
Are you sure that the start_HarvestController... script is reading the xml settings file? You should have a line in the script like
java -Xmx1024m -Ddk.netarkivet.settings.file=/mypath/mysettings.xml
.....
@bnfklm here is what I got as error :
12:35:35.481 INFO d.n.h.h.HarvestControllerServer - Received crawlrequest for job 8: 'ID:943-127.0.1.1(a0:dd:f7:50:c1:65)-33179-1476790535602: To TEST_COMMON_JOB_PARTIAL_FOCUSED ReplyTo TEST_COMMON_ERROR OK Job: Job 8 (state = SUBMITTED, HD = 7, channel = FOCUSED, snapshot = false, forcemaxcount = -1, forcemaxbytes = 1000000000, forcemaxrunningtime = 0, orderxml = default_orderxml, numconfigs = 1, created = Tue Oct 18 12:35:10 CET 2016, submitted = Tue Oct 18 12:35:35 CET 2016), metadata: [URL= metadata://netarkivet.dk/crawl/setup/duplicatereductionjobs?majorversion=1&minorversion=0&harvestid=7&harvestnum=0&jobid=8 ; mimetype= text/plain ; data= ]' 12:35:35.496 INFO d.n.h.h.HarvestControllerServer - Started harvester thread for job 8 12:35:35.500 INFO d.n.common.distribute.JMSConnection - Removing listener from channel 'TEST_COMMON_JOB_PARTIAL_FOCUSED' 12:35:35.510 INFO d.n.harvester.heritrix3.HarvestJob - Created crawl directory: 'harvester_focused/8_1476790535508' 12:35:35.745 INFO d.n.h.indexserver.FileBasedCache - Metadata cache for 'DEDUP_CRAWL_LOG' uses directory '/home/test/TEST/cache/DEDUP_CRAWL_LOG' 12:35:35.788 INFO d.n.h.i.d.IndexRequestClient - Requesting an index of type 'DEDUP_CRAWL_LOG' for the jobs [] 12:35:36.156 ERROR d.n.h.h.HarvestControllerServer - Fatal error while operating job 'Job 8 (state = SUBMITTED, HD = 7, channel = FOCUSED, snapshot = false, forcemaxcount = -1, forcemaxbytes = 1000000000, forcemaxrunningtime = 0, orderxml = default_orderxml, numconfigs = 1, created = Tue Oct 18 12:35:10 CET 2016, submitted = Tue Oct 18 12:35:35 CET 2016)' dk.netarkivet.common.exceptions.IllegalState: Reply message not ok. Message is: 'java.lang.ExceptionInInitializerError java.lang.ExceptionInInitializerError at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:195) at dk.netarkivet.common.utils.SettingsFactory.getInstance(SettingsFactory.java:67) at dk.netarkivet.common.distribute.RemoteFileFactory.getInstance(RemoteFileFactory.java:67) at dk.netarkivet.common.distribute.RemoteFileFactory.getCopyfileInstance(RemoteFileFactory.java:131) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer.packageResultFiles(IndexRequestServer.java:446) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer.doProcessIndexRequestMessage(IndexRequestServer.java:345) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer.access$000(IndexRequestServer.java:76) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer$2.run(IndexRequestServer.java:238) Caused by: dk.netarkivet.common.exceptions.UnknownID: No match for key 'settings.common.remoteFile.datatimeout' in settings at dk.netarkivet.common.utils.Settings.get(Settings.java:151) at dk.netarkivet.common.utils.Settings.getInt(Settings.java:163) at dk.netarkivet.common.distribute.FTPRemoteFile.
(FTPRemoteFile.java:70) ... 9 more ' in index request for jobs at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.checkMessageValid(IndexRequestClient.java:320) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:182) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:445) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] 12:35:36.476 ERROR d.n.common.utils.EMailNotifications - Mailing NetarchiveSuite-ERROR: Fatal error while operating job 'Job 8 (state = SUBMITTED, HD = 7, channel = FOCUSED, snapshot = false, forcemaxcount = -1, forcemaxbytes = 1000000000, forcemaxrunningtime = 0, orderxml = default_orderxml, numconfigs = 1, created = Tue Oct 18 12:35:10 CET 2016, submitted = Tue Oct 18 12:35:35 CET 2016)' dk.netarkivet.common.exceptions.IllegalState: Reply message not ok. Message is: 'java.lang.ExceptionInInitializerError java.lang.ExceptionInInitializerError at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:195) at dk.netarkivet.common.utils.SettingsFactory.getInstance(SettingsFactory.java:67) at dk.netarkivet.common.distribute.RemoteFileFactory.getInstance(RemoteFileFactory.java:67) at dk.netarkivet.common.distribute.RemoteFileFactory.getCopyfileInstance(RemoteFileFactory.java:131) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer.packageResultFiles(IndexRequestServer.java:446) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer.doProcessIndexRequestMessage(IndexRequestServer.java:345) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer.access$000(IndexRequestServer.java:76) at dk.netarkivet.harvester.indexserver.distribute.IndexRequestServer$2.run(IndexRequestServer.java:238) Caused by: dk.netarkivet.common.exceptions.UnknownID: No match for key 'settings.common.remoteFile.datatimeout' in settings at dk.netarkivet.common.utils.Settings.get(Settings.java:151) at dk.netarkivet.common.utils.Settings.getInt(Settings.java:163) at dk.netarkivet.common.distribute.FTPRemoteFile. (FTPRemoteFile.java:70) ... 9 more ' in index request for jobs at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.checkMessageValid(IndexRequestClient.java:320) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:182) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.cacheData(IndexRequestClient.java:63) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.FileBasedCache.cache(FileBasedCache.java:146) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.FileBasedCache.getIndex(FileBasedCache.java:203) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.indexserver.distribute.IndexRequestClient.getIndex(IndexRequestClient.java:63) ~[harvester-core-5.1.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestJob.fetchDeduplicateIndex(HarvestJob.java:228) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestJob.writeHarvestFiles(HarvestJob.java:171) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestJob.init(HarvestJob.java:85) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:445) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] 12:35:36.478 INFO d.n.h.h.HarvestControllerServer - Ending crawl of job : 8 12:35:36.482 INFO d.n.h.heritrix3.PostProcessing - Looking for unprocessed crawldata in 'harvester_focused' 12:35:36.483 WARN d.n.h.heritrix3.PostProcessing - Found old unprocessed job data in dir '/home/test/TEST/harvester_focused/8_1476790535508'. Crawl probably interrupted by shutdown of HarvestController. Processing data. 12:35:36.613 ERROR d.n.common.utils.EMailNotifications - Mailing NetarchiveSuite-WARNING: Found old unprocessed job data in dir '/home/test/TEST/harvester_focused/8_1476790535508'. Crawl probably interrupted by shutdown of HarvestController. Processing data. 12:35:36.659 INFO d.n.h.heritrix3.PostProcessing - Store files in directory 'harvester_focused/8_1476790535508' from jobID: 8. 12:35:36.660 INFO d.n.h.heritrix3.PostProcessing - Store the files from harvest in 'harvester_focused/8_1476790535508' 12:35:41.796 INFO d.n.h.heritrix3.HarvestDocumentation - Looking for heritrix files in the following directories: /home/test/TEST/harvester_focused/8_1476790535508,/home/test/TEST/harvester_focused/8_1476790535508/heritrix3/jobs/8_1476790535508, /home/test/TEST/harvester_focused/8_1476790535508/heritrix3/jobs/8_1476790535508/latest/reports 12:35:41.819 WARN d.n.h.heritrix3.HarvestDocumentation - The directory /home/test/TEST/harvester_focused/8_1476790535508/heritrix3/jobs/8_1476790535508 does not exist 12:35:41.820 WARN d.n.h.heritrix3.HarvestDocumentation - The directory /home/test/TEST/harvester_focused/8_1476790535508/heritrix3/jobs/8_1476790535508/latest/reports does not exist 12:35:41.844 INFO d.n.h.h.m.MetadataFileWriterWarc - harvester_focused/8_1476790535508/crawler-beans.cxml 28812 12:35:41.853 INFO d.n.h.h.m.MetadataFileWriterWarc - harvester_focused/8_1476790535508/harvestInfo.xml 603 12:35:41.856 INFO d.n.h.h.m.MetadataFileWriterWarc - harvester_focused/8_1476790535508/seeds.txt 24 12:35:41.858 INFO d.n.h.h.m.MetadataFileWriterWarc - harvester_focused/8_1476790535508/archivefiles-report.txt 39 12:35:41.861 WARN d.n.h.heritrix3.HarvestDocumentation - Found no archive directory with ARC og WARC files. Looked for dirs '/home/test/TEST/harvester_focused/8_1476790535508/heritrix3/jobs/8_1476790535508/latest/arcs' and '/home/test/TEST/harvester_focused/8_1476790535508/heritrix3/jobs/8_1476790535508/latest/warcs'. 12:35:41.862 WARN d.n.h.heritrix3.PostProcessing - Probable error in Heritrix job setup. No arcfiles or warcfiles generated by Heritrix for job 8 12:35:41.973 ERROR d.n.common.utils.EMailNotifications - Mailing NetarchiveSuite-WARNING: Probable error in Heritrix job setup. No arcfiles or warcfiles generated by Heritrix for job 8 12:35:41.974 WARN d.n.h.heritrix3.PostProcessing - Trouble during postprocessing of files in '/home/test/TEST/harvester_focused/8_1476790535508' dk.netarkivet.common.exceptions.IllegalState: Metadata file /home/test/TEST/harvester_focused/8_1476790535508/metadata/8-metadata-1.warc does not exist at dk.netarkivet.harvester.heritrix3.IngestableFiles.getMetadataArcFiles(IngestableFiles.java:179) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.PostProcessing.storeFiles(PostProcessing.java:281) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.PostProcessing.doPostProcessing(PostProcessing.java:159) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.PostProcessing.processOldJobs(PostProcessing.java:124) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:466) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] 12:35:42.111 ERROR d.n.common.utils.EMailNotifications - Mailing NetarchiveSuite-ERROR: Trouble during postprocessing of files in '/home/test/TEST/harvester_focused/8_1476790535508'. Errors accumulated during the postprocessing: Metadata file /home/test/TEST/harvester_focused/8_1476790535508/metadata/8-metadata-1.warc does not exist
dk.netarkivet.common.exceptions.IllegalState: Metadata file /home/test/TEST/harvester_focused/8_1476790535508/metadata/8-metadata-1.warc does not exist at dk.netarkivet.harvester.heritrix3.IngestableFiles.getMetadataArcFiles(IngestableFiles.java:179) ~[netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.PostProcessing.storeFiles(PostProcessing.java:281) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.PostProcessing.doPostProcessing(PostProcessing.java:159) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.PostProcessing.processOldJobs(PostProcessing.java:124) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] at dk.netarkivet.harvester.heritrix3.HarvestControllerServer$HarvesterThread.run(HarvestControllerServer.java:466) [netarchivesuite-heritrix3-controller.jar:cde61d78299cabccae6195908b81ef77c84a76b9] 12:35:42.115 WARN d.n.h.heritrix3.PostProcessing - Job with ID 8 finished with status FAILED
@csrster yes here is the line in the start_HarvestControllerApplication_focused.sh file :
java -Xmx1536m -Ddk.netarkivet.settings.file=/home/test/TEST/conf/settings_HarvestControllerApplication_focused.xml -Dlogback.configurationFile=/home/test/TEST/conf/logback_HarvestControllerApplication_focused.xml dk.netarkivet.harvester.heritrix3.HarvestControllerApplication < /dev/null > start_HarvestControllerApplication_focused.log 2>&1 &
What I can understand here is that you are using remoteFile configuration. I am not using it so I can't really help you but I think you have in your configuration file something like a FTP server to transfer your crawled files to.
This configuration need more parameters so that's maybe why you have this error about this missing parameter :
settings.common.remoteFile.datatimeout
Here is a example of this configuration.
@bnfklm do you use HTTP hor HTTPS instead of FTP ? I deployed nas with an old deploy-xml file in which the name of jars are like (dk.netarckive.) and then I changed manually the name of jars in the different .sh files, I doubt that what cause problems for me! can you give me your deploy-xml file which works for you and I will change it to fit to my architecture. My architecutre composed by 6 machines admin machine server machine 2 harvester machines 2 bitarchive machines please I need help I have spent to much time on this and I need to deploy nas as soon as possible here is the deploy file I used : https://github.com/nasry/parcours_etudiants/blob/master/deploy.xml
please check this issue to understand more what I am talking about: https://github.com/netarchivesuite/netarchivesuite/issues/19
i am grateful for your help
We use local file system to write files instead of bitarchive, so my configuration won't help you a lot for this part, but here is my local configuration (one machine for all)
This is what we use to write archive files into the local filesystem
<arcrepositoryClient>
<class>dk.netarkivet.common.distribute.arcrepository.LocalArcRepositoryClient</class>
<fileDir>/home/ccm/ccm_klm/nas_repository/arcs</fileDir>
</arcrepositoryClient>
If it can help you, there are others configurations here
Created internal track-issue for this thread https://sbprojects.statsbiblioteket.dk/jira/browse/NARK-1530
Hello, When I run a new job I got this error in the HarvestControllerApplication_focused.log log file:
Knowing that I am running NAS in a distributed environment using this command: "./RunNetarchiveSuite.sh distribution-5.1.zip deploy_distributed_example.xml deploy heritrix3-bundler-5.1.zip".
I have added this setting < bundle > /home/test/heritrix3-bundler-5.1.zip< /bundle > to settings_HarvestControllerApplication_focused.xml file under "settings > harvester >harvesting >heritrix" but it still not working!
thanks for your help