langchain-ai / langchainjs

πŸ¦œπŸ”— Build context-aware reasoning applications πŸ¦œπŸ”—
https://js.langchain.com/docs/
MIT License
12.44k stars 2.11k forks source link

UnstructuredURLLoader feature is not availble #2251

Closed apurvjha123 closed 10 months ago

apurvjha123 commented 1 year ago

Due to this issue I have to use cherrio, html-to-text which is given in documentation , there are some cons of this. When I asked RetrivalQA then It return css properties, URLs and javascript as well , because they are involved init.

Can please some one suggest what is the alternate of ### UnstructuredURLLoader in javascript langchain Please .****

dosubot[bot] commented 1 year ago

πŸ€–

Hello,

Based on the context provided, it seems you're looking for an alternative to the UnstructuredURLLoader feature in the Javascript version of LangChain. The equivalent feature in the Javascript version of LangChain is the UnstructuredLoader class. This class is designed to load unstructured data from various file types and convert them into a structured format that can be used by the LangChain application.

Here is an example of how you can use the UnstructuredLoader class:

import * as path from "node:path";
import { UnstructuredLoader } from "langchain/document_loaders/fs/unstructured";

const filePath = path.resolve(
  path.dirname(url.fileURLToPath(import.meta.url)),
  "./example_data/example.txt"
);

const options = {
  apiKey: "MY_API_KEY",
};

const loader = new UnstructuredLoader(filePath, options);
const docs = await loader.load();

In this example, the UnstructuredLoader class is used to load a file from the specified path. The load method is then called to convert the unstructured data into a structured format. The resulting documents are then stored in the docs variable.

As for handling CSS properties, URLs, and Javascript when loading unstructured data, I wasn't able to find specific information on how the UnstructuredURLLoader handles these elements. However, it's worth noting that the purpose of the UnstructuredLoader is to convert unstructured data into a structured format that can be used by the LangChain application. Therefore, it's likely that these elements are either ignored or processed in a way that allows them to be included in the structured data.

If you're experiencing issues with these elements being included in your structured data, you might need to implement additional processing steps to remove or handle these elements as needed. This could involve using a library like cheerio or html-to-text to parse the HTML and remove unwanted elements, as you mentioned in your question.

I hope this helps! If you have any further questions or need more clarification, please let me know.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a πŸ‘ if this is helpful and πŸ‘Ž if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

apurvjha123 commented 1 year ago

the data diffrence between python and js is too diffrent 'using python'

[Document(page_content="Explore STCET\n\nSt. Thomas' College of Engineering & Technlogy\n\nAICTE Approved, MAKAUT affiliated, with NBA Accredited Programmes\n\nApproved By AICTE\n\nAffiliated to MAKAUT, WB\n\nNBA Accredited Programmes Available\n\nAbout Us\n\nVision\n\nTo evolve as an industry oriented, research based Institution for creative solutions in various engineering domains, with an ultimate objective of meeting technological challenges faced by the Nation and the Society.

and when i use js the data comes something like this:-

using html-to-text :-

'Document { pageContent: ' Skip to contentOnline FeesNIRFCareer in STCETOnline FeesNIRFCareer in STCET/! elementor - v3.15.0 - 02-08-2023 /\n' + '.elementor-widget-image{text-align:center}.elementor-widget-image a{display:inline-block}.elementor-widget-image a img[src$=".svg"]{width:48px}.elementor-widget-image img{vertical-align:middle;display:inline-block} /! elementor - v3.15.0 - 02-08-2023 /\n' + ".elementor-heading-title{padding:0;margin:0;line-height:1}.elementor-widget-heading .elementor-heading-title[class*=elementor-size-]>a{color:inherit;font-size:inherit;line-height:inherit}.elementor-widget-heading .elementor-heading-title.elementor-size-small{font-size:15px}.elementor-widget-heading .elementor-heading-title.elementor-size-medium{font-size:19px}.elementor-widget-heading .elementor-heading-title.elementor-size-large{font-size:29px}.elementor-widget-heading .elementor-heading-title.elementor-size-xl{font-size:39px}.elementor-widget-heading .elementor-heading-title.elementor-size-xxl{font-size:59px}St. Thomas' Collegeof Engineering & Technology B.Tech / M.Tech admissions 2023-2024 - 8017993801 Online FeesNIRFCareer in STCETOnline FeesNIRFCareer in STCET Menu Home About Us Achievements Placements Departments Facilities Admission Contact Us Home About Us Brief HistoryCollege TrustGovernanceOrganizational ChartAcademic AdministrationInstitute Cells & ClubsAcademic PerformanceRules & RegulationsCollege TimingMiscellaneous Information College Brochure Download Achievements University Rank HoldersCollege ToppersSuccess Stories of GATESummer Internship ProgrammePublicationsHigher StudiesMOOCS Placements Placement OverviewVisiting Companies Departments Computer Science and EngineeringUG CoursesComputer Science and EngineeringArtificial Intelligence & Machine LearningPG CoursesArtificial Intelligence & Data ScienceElectronics & Communication EngineeringUG CoursesElectronics & Communication EngineeringPG CoursesInternet of ThingsInformation TechnologyUG CoursesInformation TechnologyElectrical EngineeringUG CoursesElectrical EngineeringBasic Science & HumanitiesFoundation CoursesBasic Science & Humanities Facilities Central Computer CenterCentral LibraryInternet & IntranetLanguage Practice LaboratoryRecreation Admission Admission NoticeApproval, Affiliation & AccreditationAdmission ProcedureIntake CapacityFee StructureScholarships Contact Us St. Thomas' Collegeof Engineering & Technology .elementor-nav-menu__container{\n" +
'top:0px!important;\n' + '}\n' +'

I want data like python in javascript .

dosubot[bot] commented 11 months ago

Hi, @apurvjha123! I'm Dosu, and I'm helping the langchainjs team manage our backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you raised was about the UnstructuredURLLoader feature not being available in the JavaScript version of LangChain. In a comment, I suggested using the UnstructuredLoader class as an alternative and even provided an example of how to use it. Additionally, it seems that your concern about the difference in data between Python and JavaScript has been addressed.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the langchainjs repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to langchainjs!

Greasen commented 6 months ago

UnstructuredLoader εŠŸθƒ½δΎη„ΆδΈε―η”¨οΌŒεœ¨jsη‰ˆζœ¬δΈ­γ€‚ζη€Ίι”™θ――