部分论文无法通过知网DOI页获取元数据

pixiandouban commented 2 years ago

部分论文通过DOI跳转到知网论文页面后，Zotero 无法正确识别知网，跳转到DOI识别后，无法获取元数据。

目前Zotero无法通过知网论文DOI获取元数据。

举例，DOI地址 -> 原地址。即第一个网址无法获取数据。

Lemmingh commented 2 years ago

请考虑这样描述问题：

无法识别由 DOI 跳转到的知网页面

例如，由 10.13366/j.dik.2016.06.062 可以到达 https://www.cnki.net/kcms/doi/10.13366/j.dik.2016.06.062.html。

但是，目前无法识别这样的页面。

我先发个牢骚：

知网的网址真是群魔乱舞，啥玩意都有。

如果你只需要考虑单个文献的页面（文献知网节），不妨把

https://github.com/l0o0/translators_CN/blob/b2e121fa13c63c3104c1c2c71575096fe64e3261/translators/CNKI.js#L5

改成

"^https?://([a-z0-9-]+\\.)?cnki\\.net/kcms/"

pixiandouban commented 2 years ago

@Lemmingh @l0o0 知网的网址，顺便在这里提出些建议，可以把 item 的 url 统一输出成 https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=$dbcode&filename=$filename（或者前面加overseas表示海外版），网址里面的参数 dbname、v、uniplatform不需要（也不知道有什么用）。

有时从期刊主页进入论文时，网址不显示 dbcode、filename，需要自己填充这两项参数。比如，用上面出现的论文举例，带V参见长链接 VS 精简链接。

另外，说一下万方数据的链接，现在从搜索界面进去，网址中 periodical 后跟着一长串字符，这个长字符串可以换成论文ID（大多数应该是可以从网页上可以找到，等同知网的 filename）。

Lemmingh commented 2 years ago

这样？

/**
 * Normalizes a 文献知网节 URL.
 *
 * @returns The URL will begin with `https://kns.cnki.net/kcms/detail/detail.aspx?`.
 */
const normalizeUrl = (info: Cnki.Params): URL => {
    const result = new URL("https://kns.cnki.net/kcms/detail/detail.aspx");

    // A URL must have either `dbcode` or `dbname`, otherwise, CNKI will not accept it.
    // CNKI is case-insensitive. However, parameters are conventionally lower case, and values are upper case.
    result.searchParams.set("dbcode", info.dbCode.toUpperCase());
    result.searchParams.set("dbname", info.dbCode.toUpperCase() + "TOTAL");
    result.searchParams.set("filename", info.fileName.toUpperCase());

    return result;
};

pixiandouban commented 2 years ago

@Lemmingh 对。

Lemmingh commented 2 years ago

有时从期刊主页进入论文时，网址不显示 dbcode、filename

是的，知网的网址形式繁多。

我自己是这样处理的：

/**
 * CNKI DB code (upper case) -> Zotero item type.
 */
const Db_Type_Map = Object.freeze<Cnki.DbTypeMap>({
    CJFQ: "journalArticle",
    CJFD: "journalArticle",
    CAPJ: "journalArticle",
    SJES: "journalArticle",
    SJPD: "journalArticle",
    SSJD: "journalArticle",
    CCJD: "journalArticle",
    CDMD: "journalArticle",
    CYFD: "journalArticle",
    CDFD: "thesis",
    CMFD: "thesis",
    CLKM: "thesis",
    CCND: "newspaperArticle",
    CPFD: "conferencePaper",
    IPFD: "conferencePaper",
    SCPD: "patent",
});

/**
 * Checks if the code is a known DB code.
 */
const validateDbCode = (code: string): code is Cnki.DbCode => {
    return Object.prototype.hasOwnProperty.call(Db_Type_Map, code);
};

/**
 * Gets basic info of a single 文献知网节 page.
 *
 * Be fast! `detectWeb()` calls this function.
 */
const getSinglePageInfo = (doc: Document): Cnki.SinglePageInfo | undefined => {
    const params = new Map(
        Array.from(doc.querySelectorAll<HTMLInputElement>('input[id^="param"]'), (e) => [e.id.slice(5), e.value])
    );

    // These parameters should always exist.
    const dbCode = params.get("dbcode")!.toUpperCase();
    const dbName = params.get("dbname")!.toUpperCase();
    const fileName = params.get("filename")!.toUpperCase();

    const dbCode2 = dbName.slice(0, 4);
    const code = validateDbCode(dbCode2) ? dbCode2 : validateDbCode(dbCode) ? dbCode : undefined;

    // Return `undefined` for unknown code.
    return code ? { dbCode: code, dbName, fileName, itemType: Db_Type_Map[code] } : undefined;
};

万方数据的链接，现在从搜索界面进去，网址中 periodical 后跟着一长串字符

万方的那一长串叫做 uid。Base64 解码后可以看到 3 段，中间那一段是 id。

pixiandouban commented 2 years ago

谢谢，有几行代码有点复杂，个人还要再学习一下。

Lemmingh commented 2 years ago

那些是 TypeScript。你可以用 esbuild 编译出来看。

Lemmingh commented 2 years ago

网址里面的参数 dbname 也不知道有什么用

应该是为了拆表，改善查询性能。

按照这个思路， normalizeUrl() 中 dbname 那行应当取 info.dbName，避免给知网增加太大负担。

l0o0 / translators_CN

部分论文无法通过知网DOI页获取元数据 #61

无法识别由 DOI 跳转到的知网页面