matthias-samwald / find-me-evidence

An open-source medical search engine
GNU Affero General Public License v3.0
9 stars 1 forks source link

Crawling: In doc2doc the same content appears multiple times under slightly different URIs #11

Open matthias-samwald opened 10 years ago

matthias-samwald commented 10 years ago

See http://doc2doc.bmj.com/forums/off-duty_general_gordon-should-stuck-his-guns?plckFindPostKey=Cat:OffDutyForum:GeneralDiscussion:7860c5a4-ad9e-4897-9641-d89ab42c1407Post:7860c5a4-ad9e-4897-9641-d89ab42c1407

http://doc2doc.bmj.com/forums/off-duty_general_gordon-should-stuck-his-guns?plckFindPostKey=Cat:OffDutyForum:GeneralDiscussion:7860c5a4-ad9e-4897-9641-d89ab42c1407Post:1ce63bb4-a495-4515-a6da-39f15c96e84a

http://doc2doc.bmj.com/forums/off-duty_general_gordon-should-stuck-his-guns?plckFindPostKey=Cat:OffDutyForum:GeneralDiscussion:7860c5a4-ad9e-4897-9641-d89ab42c1407Post:3519ae78-57e0-4a09-a3d5-2eed3bc4aa6c

These URLs point to different subesctions (individual posts) in the same web page. Only these URLs should be included for the discussion thread in this example: http://doc2doc.bmj.com/forums/off-duty_general_gordon-should-stuck-his-guns_.0 and http://doc2doc.bmj.com/forums/off-duty_general_gordon-should-stuck-his-guns_.1